In this section, we introduce the four methods of ethnicity assignment investigated in this study and the datasets used to evaluate their empirical performance. We begin by briefly introducing principal component analysis (PCA), a dimensionality reduction technique used as a preprocessing step for three of the four methods. We then describe the four classification algorithms – support vector machines (SVM), linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and 1-nearest neighbor (1NN). Finally, we describe the datasets used for evaluation, the conversion of mtDNA sequence profiles into feature vectors, and methods of encoding sequences with missing regions.

### Principal component analysis

PCA (see [

6] for an introduction) is a factor analysis technique of dimensionality reduction. Given

*m* samples over

*n* variables, the

*m* samples can be represented as a

*m* ×

*n* matrix

**X**. We further assume that the sample mean of each variable is 0, that is,

for every

*j*. Projecting the

*m* samples onto

*n* new axes yields another

*m* ×

*n* matrix

**Y** =

**XP**, where

**P** is a

*n* ×

*n* orthogonal matrix whose columns are unit vectors defining the

*n* new axes. PCA finds a

**P** such that the sample covariance matrix of the

*n* new variables is a diagonal matrix, that is,

where **D** is a diagonal matrix, and **Σ**_{
X
} and **Σ**_{
Y
} are the sample covariance matrices of the original and new variables, respectively. The orthogonal matrix **P** can be easily obtained by eigenvalue decomposition of **Σ**_{
X
}. PCA is a dimensionality reduction technique in that only *k* of the *n* new variables are kept for further analysis. A standard approach is to pick the *k* variables with the largest sample variances. Therefore, all we need to do is to pick the value of *k.* Fortunately, when PCA is used in conjunction with supervised learning algorithms like classification algorithms, the best value of *k* can be selected by performing cross-validation. In this study, *k* was selected by performing 5-fold cross-validation (CV) on the training data for each combination of dataset and classification algorithm.

### Classification algorithms

#### Support vector machines

The SVM [

12] is a binary classification algorithm. In the case of perfectly separable classes, SVM seeks a separating hyperplane with maximum margin, while for non-separable classes the goal is to maximize a linear combination of the separation margin and the total amount by which SVM predictions fall on the wrong side of their margin. Given

*n*-element feature vectors

**x**_{
i
},

*i* = 1,…,

*m*, and an

*m*-element label vector

**y** such that

*y*_{
i
} ? {1, –1}, this amounts to solving the following optimization problem:

where *C* > 0 is a penalty constant, *ξ*_{
i
} is the slack variable allowing misclassification of sample *i*, *Φ*(⋅) is a function that maps x_{
i
} to a high-dimensional space, often called the feature space, and β, *β*_{0} define the optimum separating hyperplane β^{T}z + *β*_{0} = 0 in feature space. Once the optimal separating hyperplane is found, a test sample t is classified according to the sign of β^{T}*Φ*(t) + *β*_{0}.

In practice, the solution to the convex optimization problem (2) is obtained by solving the so-called Wolfe dual. Instead of explicitly mapping samples to the feature space, solving the dual requires only a kernel function K(x_{1},x_{2}) = *Φ*(x_{ι})^{T}*Φ*(x_{2}), which implicitly maps samples to the feature space and simultaneously computes the inner product [12]. In this study, we used the software package LIBSVM [13] to conduct all SVM experiments. LIBSVM uses the “one-against-one” approach [14] when more than two classes are present. For all SVM experiments we used the radial basis kernel K(x_{1},x_{2}) = exp(*-γ|* x_{1}*-* x_{2}|^{2}), where *γ* is a parameter. The penalty constant *C* and the parameter *γ* were tuned using 5-fold cross-validation on the training data.

#### Linear and quadratic discriminant analysis

LDA and QDA assume that for each class the feature vectors follow a multivariate normal distribution [

6]. That is, the conditional probability of a sample

x given that it belongs to class

*g* is given by

By applying Bayes’ theorem, we obtain the posterior distribution as follows.

where *π*_{
g
} is the prior probability of class *g.* The parameters of the multivariate normal distribution are estimated using the training dataset. LDA assumes that the classes have a common covariance matrix (i.e., **Σ**_{
g
} = **Σ** for every *g*) therefore fewer parameters need to be estimated for LDA compared to QDA. For both methods, a given test sample t is assigned to the class with the highest posterior probability

argmax_{
g
} Pr(*G* = *g|* X = t)*.*

In this study, we used MCLUST Version 3 [15] to conduct all LDA and QDA experiments.

#### 1-nearest neighbor (1NN)

1NN is a simple non-parametric classification algorithm, which does not have a training process. Given a set of reference samples and a test sample, 1NN searches the reference dataset for the sample nearest to the test sample and assigns the test sample to the class to which the nearest sample belongs. In case there are multiple nearest reference samples, voting is used to assign the test sample to the class containing the largest number of nearest reference samples. As discussed below, mtDNA profiles are encoded into binary feature vectors. We used the number of mismatch positions (a.k.a. the Hamming distance) to measure the distance between samples, and did not apply PCA to the data before applying 1-NN.

### Datasets

We used the forensic and published tables in the mtDNA population database [7] to empirically evaluate the performance of the four algorithms for ethnicity assignment. The forensic table contains 4,839 samples collected and typed by the Federal Bureau of Investigation (FBI), while the published table contains 6,106 samples collected from the literature.

In this study, we focus only on the samples annotated as belonging to one of the four coarse ethnic groups – Caucasian, African, Asian and Hispanic. Filtering the forensic and published tables by this criteria results in 4,426 and 3,976 samples, respectively. In the rest of the paper we will refer to the two filtered tables simply as the *forensic* and *published datasets*. The forensic dataset contains 1,674 Caucasian (37.8%), 1,305 African (29.5%), 761 Asian (17.2%) and 686 Hispanic (15.5%) samples, while the published dataset is comprised of 2,807 Caucasian (70.6%), 254 African (6.4%) and 915 Asian (23%) samples.

Additional file 1 shows the percentage of samples sequenced at each position for the forensic and published datasets. We note that the forensic dataset has a significantly better coverage than the published dataset. All the samples in the forensic dataset cover portions of both hypervariable region 1 (HVR1) and hypervariable region 2 (HVR2) of mtDNA, whereas over 60% of samples in the published dataset do not cover HVR2 and around 5% of them do not cover HVR1.

To better characterize and compare the forensic and published datasets, we assign each sample in the two datasets to one of the 23 basal haplogroups defined in [8]. Haplogroup assignment was performed using the unweighted 1NN algorithm described in [8] along with the Genographic Project open resource mitochondrial DNA database (the consented database) of 21,164 samples [16]. Behar et al. [8] reported a leave-one-out cross-validation accuracy of 96.72% on a reference database of 16,609 samples. We observed a comparable accuracy of 96.51% on the consented database. Therefore, we expect the inferred haplogroups of samples in the forensic and published datasets to have a similarly high accuracy. The ethnicity composition of each haplogroup and the inferred haplogroup composition of each broad ethnic group represented in the forensic and published datasets are given in Additional file 2. Additional file 2(A) supports the well known fact that many haplogroups are strongly associated with a specific ancestry. For example, most samples with inferred haplogroup H, J, K, R0*, T, U*, and V are Caucasian, most samples with inferred haplogroup B, D, M, N, and R9 are Asian, and most samples with inferred haplogroup L are African. However, the association is not perfect, and significant percentages of these haplogroups are present in other ethnic groups. For some haplogroups, such as B, N1*, W, and X the association with ethnicity is particularly weak, with two or three ethnicities being represented in almost equal proportions. Additional file 2 further shows that the forensic and published datasets have significant differences in their ethnic and haplogroup compositions. Most strikingly, Caucasians are significantly over-represented and Hispanics are completely missing from the published dataset. Such differences are most likely due to the procedure used to assemble the published dataset, and reflects preferential use of samples from some ethnic groups in published studies.

For some of the experiments described in the Results section, we used specific subsets of the forensic and published datasets. The *full-length forensic* *dataset* consists of the 1,904 samples typed for the most extensive ranges of HVR1 (16024–16569) and HVR2 (1–576). This dataset is comprised of 222 Caucasian (11.7%), 820 African (43.1%), 415 Asian (21.8%) and 447 Hispanic (23.5%) samples. The *trimmed forensic dataset* was produced by trimming the samples in the forensic dataset such that only the region of 16024–16365 in HVR1 is kept. It has the same ethnicity composition as the forensic dataset since all samples in the forensic dataset are typed in this range. The *trimmed published* *dataset* was created in a similar fashion, except that only 2,540 samples covering the 16024-16365 region were kept. This subset contains 1,956 Caucasian (77%), 134 African (5.3%) and 450 Asian (17.7%) samples.