### Statistical methods and notation in SEM

The statistical theory involved in SEM is extensive so we only present the general concepts and notation needed to follow this work. SEM comprises two general sub-models: 1) a *measurement model* that develops the relationships between the observed variables (indicators) and the latent (unobserved) variables; and, 2) a *structural model* that develops the relationships between the latent variables. The general form of the *measurement model* is as follows [1]:

**x** = **Λ**_{
x
}**ξ** + **δ**

**y** = **Λ**_{
y
}**η** + **ε**,

where

**x** = q × 1 is a vector of observed indicators in the exogenous latent variables (**ξ**);

**y** = p × 1 is a vector of observed indicators in the endogenous latent variables (**η**);

**ξ** = n × 1 is a vector of latent exogenous (independent) latent random variables;

**η** = m × 1 is a vector of latent endogenous (dependent) latent random variables;

**δ** = q × 1 is a vector of measurement errors for **x**;

**ε** = p × 1 is a vector measurement errors for **y**;

**Λ**_{
x
}= q × n is a matrix of coefficients relating **x** to **ξ**; and,

**Λ**_{
y
}= p × n is a matrix of coefficients relating **y** to **η**.

The general form of the *structural model* is as follows [1]:

**η** = **B η** + **Γ ξ** + **ζ**,

where

**B** = m × m is a matrix of path coefficients for latent endogenous variables (η);

**Γ** = m × n is a matrix of path coefficients for latent exogenous variables (ξ); and,

**ζ** = m × 1 is a vector that represents the errors or disturbances in η.

To simplify notation, we refer to λ_{xi} and λ_{yi} values collectively as λ_{i} values or measurement model "loadings" and β_{j} and γ_{j} values as β_{j} values or structural model "path coefficients".

In SEM, the null hypothesis assumes that if the conceptualized model were correct, the population covariance matrix of the observed variables, **Σ**, would be exactly reproduced by the covariance matrix determined by the model parameters, **Σ**(**θ**) (Ho: **Σ** = **Σ**(**θ**)). Thus, covariance-based SEM aims to test "causal" model theory by minimizing the difference between the sample covariance matrix (**S**) and the covariance matrix defined by the model parameters (**Σ**(**θ**)) using a fitting function. Maximum likelihood (ML) estimation fitting functions (F_{ML}) such as the following, where *p* is the number of observed variables, are often used for global optimization but require the rigid assumptions of multivariate normality and independence of observations [1]:

**F**_{ML} = log|**Σ**(**θ**)| + tr[**S Σ**(**θ**)^{-1}] - log|**S**| - (p).

Thus, for models with non-normal variables, a weighted least squares (WLS) fitting function (F_{WLS}) should be used to obtain unbiased estimates, standard errors and model tests [1, 2]:

**F**_{WLS} = [**ρ** - **σ**(**θ**)]' **W**^{-1} [**ρ** - **σ**(**θ**)],

where:

**W**^{-1} is the weight matrix for the residuals;

**ρ** is the vector of elements containing polychoric, tetrachoric, and polyserial correlations;

**σ**(**θ**) is corresponding vector from same-order implied matrix **Σ**(**θ**); and,

**θ** is the t × 1 vector of free parameters.

Variations to these fitting functions have been devised, including the robust weighted least-squares estimator (WLSMV), which allows for estimation of binary and categorical dependent variables [2, 3], and a ML estimator robust to non-normality (MLF) [4].

### Data preparation and modeling procedures

First, we randomly selected a training (Replicate 64) and a validation (Replicate 46) data set from the 100 replicates simulated. In an attempt to satisfy the requirement for independence of observations, the data files were reconstructed by randomly selecting one case from each affected sib pair (*n*_{1} = 1500) and including all unrelated controls (n_{2} = 2000). The SNPs were coded assuming an additive genetic model (e.g., 1/1 = 0; 1/2 = 1; 2/2 = 2 where: 1 = wild type allele; 2 = variant allele). Gender and smoking were dichotomous (0 = males; 1 = females; non-smokers = 0; smokers = 1). Because IgM and anti-CCP (anti-cyclic citrinullated protein) values were only provided for cases, we arbitrarily set the values of these variables to zero for controls. Using the location of simulated risk loci provided in the answer key, we selected genotyped SNPs from the 10 K and chromosome 6 dense SNP chip panels upstream, downstream, and directly at (when available) the known location of each locus to build latent constructs for each gene. Because we did not explicitly know which SNPs were representative of each simulated locus, we also examined the linkage disequilibrium (LD) structure (Haploview v3.2) between the selected SNPs and performed factor analysis (FA) using SAS v8.2 (SAS Institute, Inc.) to help devise viable gene constructs. FA was performed independent of disease status (with only the SNP data) and was used to generate eigenvalues, inspect scree plots and factor patterns, and determine the proportion of average variance explained (AVE). AVE is an indicator of the communality or validity of the construct [5]. We also inspected LD and Pearson correlations to help confirm initial SNP selections, particularly when FA suggested the construct was less than valid (AVE < 0.50) [5] or more than one factor was emerging.

We then built the full model(s) by constructing measurement and structural model equations using the "causal" RA model information provided in the answer key. However, to obtain scale determinancy and model identification, one of the loadings (e.g., the SNP with the highest loading from the FA or, ideally, the SNP with the largest biological impact on gene function) must to be fixed to 1.0. When we were able to locate a SNP in the exact physical location of the simulated locus, we fixed that SNP's loading to 1.0. When the simulated locus fell between two SNPs, we arbitrarily selected one of them. We analyzed the models using the WLSMV estimator in Mplus v4.1 (StatModel, Inc.) and evaluated the overall fit.

The chi-square test evaluates whether the specified model is significantly different from the alternative model, which assumes the data are from a multivariate normal distribution with an unconstrained covariance matrix. However, it is affected by departures from normality and sample size (with substantially more power to falsely reject an acceptable model with large samples). Therefore, when categorical or non-normal dependent variables are modeled, a modified chi-square test [6] or other fit index robust to non-normality is needed. Although a plethora of alternative goodness-of-fit indices exists, we chose to evaluate only the following three. The root mean squared error of approximation (RMSEA) is an absolute fit index which represents dispersal of data to model discrepancy across degrees of freedom; and, a RMSEA value of less than or equal to 0.05 is believed to represent the boundary of acceptable fit [7]. The Comparative Fit Index (CFI) is an incremental fit index that it is independent of sample size and values exceeding 0.96 indicate acceptable model fit [8]. The weighted root mean square residual (WRMR) is a relatively new fit index that is believed to be better suited to categorical data. WRMR values less than 1.0 depict a good fitting model [7]. We also evaluated the coefficients and their standard errors.