Better primer design for metagenomics applications by increasing taxonomic distinguishability

Current methods of understanding microbiome composition and structure rely on accurately estimating the number of distinct species and their relative abundance. Most of these methods require an efficient PCR whose forward and reverse primers bind well to the same, large number of identifiable species, and produce amplicons that are unique. It is therefore not surprising that currently used universal primers designed many years ago are not as efficient and fail to bind to recently cataloged species. We propose an automated general method of designing PCR primer pairs that abide by primer design rules and uses current sequence database as input. Since the method is automated, primers can be designed for targeted microbial species or updated as species are added or deleted from the database. In silico experiments and laboratory experiments confirm the efficacy of the newly designed primers for metagenomics applications.


Introduction
DNA extraction and PCR amplification are essential steps in a number of different applications including forensics, sequencing, metagenomic analyses (i.e., study of community profiles) and comparison of ecosystems using their microbial profiles [1]. Design of primer pairs used in PCR depends on the target application and can be specific to a particular species, gene, ortholog group, taxon or community. In projects that aim to discover community composition and structure, primers are mainly required to (1) be universal in their ability to bind to a maximum number of different target species and (2) produce maximally distinguishable amplicons for the taxa of interest, which is the basis for the post PCR data analyses and species identification.
Launched in 2008, the Human Microbiome Project (HMP) [2] is the project du jour for studying microbial community composition and structure in different environmental niches within the human body. HMP primarily aims to understand the differences in the microbial composition between unhealthy and healthy states of five body sites: oral, skin, vaginal, gut and nasal/lung. HMP studies often utilize universal primers that amplify regions of the 16S rRNA, a popular target gene for studying taxonomic and evolutionary relationship between microbial organisms [3,4]. However, universal primers were developed over 20 years ago [5][6][7]. Since then many new bacterial and archaeal species have been discovered [8], so much so that a number of 16S rRNA databases have been created and updated many times. (RDP [9] has been updated 31 times in the last 5 years and has roughly quadrupled in size; Silva [10] with 16 full releases in 6 years, has grown six-fold with 449 new species in the last release.) Inevitably the universal primers have become too specific/biased, i.e., they fail to extract newly added species from the current database. Necessitated by particular projects, a number of taxon-specific primer pairs [11,12] have been developed and efforts have been made to improve complementarity of the majority of universal primers [13][14][15].
In this paper, primer design is cast as a constrained optimization problem, in which the pairs of primers must satisfy a range of criteria for being functional, while maximizing the distinguishability with respect to a given set of microbial organisms. Experimental results show the efficacy of the proposed method.

Background and motivation
In PCR, forward and reverse primers hybridize to specific locations on the target DNA sequence and the fragment between the primer binding sites is amplified. Efficient and effective PCR requires that the primers satisfy a set of widely accepted interdependent conditions (For example, see http://www.premierbiosoft.com/tech_notes/ PCR_Primer_Design.html): (1) melting temperatu-re (T m ) range, (2) primer specificity, (3) GC clamp, (4) primer-primer interactions, (5) bounded degeneracy, and (6) amplicon length. When primers hybridize to multiple locations, the efficiency of PCR is reduced. PCR efficiency depends greatly on the strength of hydrogen bonds formed between the primer and the template. The distinction between a strong base (C or G) and a weak base (A or T) is based on the number of hydrogen bonds (3 and 2, respectively) that a base requires to bind to the reverse strand. Primers with greater GC content result in stronger anchoring and greater energy is needed to break the bonds. Given that the PCR elongation happens at the 3' end, it is important that 3' end anchors well and better (more strongly) than the 5' end. Still, it is recommended that the 3' end have at most two strong bases. PCR works efficiently only within the temperature range defined by the primers it uses. Thus, the temperature ranges defined by the forward and reverse primers must overlap. At higher temperatures primer-template binding is more specific and it is harder for primers to form secondary structures. However, the primer-template bonds that are formed are broken more easily, which reduces PCR efficiency. Hence the need for high GC content. At low temperatures, although the primer-template bonds are not easily broken, primers hybridize more easily. Furthermore, efficiency of PCR is greatly diminished by allowing forward or reverse primer to either bind to itself (selfDimer) or to each other (primerDimer). It is particularly important that the 3' ends of forward and reverse primers do not create primer dimers. This is implicitly taken care of by the rule of not having more than two strong bases at the 3' end. High primer degeneracy could also be another factor contributing to decreased primer efficiency. A degenerate primer is a mixture of non-degenerate primers. Hence, the increase in primer degeneracy is proportional to decrease in concentration of the individual primers.
Our studies show that most of the universal primers and their derivatives in current use do not abide by all of these primer design rules. In Table 1, the primers whose IDs start with U are the universal primers. Violations of the rules governing optimal primer design are underlined and in bold font. The column labeled nHits indicates the number of RDP database sequences that the primer is likely to anneal to. For PCR to produce enough amplicon for sequencing or for metagenomic analysis this number must be as high as possible. The two best universal primers in the list, U518R and U337F, do not have a compatible reverse primer. The first one is always paired with U8F to amplify the V13 region (e.g., HMPV13 primer pair). However, U8F has a very low number of hits from RDP. That is because the sequenced data in the RDP database often does not provide sequence data in that region. The U337F primer does not have compatible reverse primer, since its temperature range does not overlap with any of the universal reverse primers (U805R, U907R, U1492R).
HMP commissioned a study that produced a set of recommended primer pairs, their experimental settings and post experimental data processing and analysis workflow [17]. The resulting primers, shown in Table 1, whose ID starts with HMP, are variants of universal primers, with degeneracy increased so as to increase the number of hits in the database, which in turn allows for more precise sequenced read classification and hence better estimation of sample diversity and richness, both of which are fundamental measures used in metagenomic studies.
The recent ubiquity and affordability of HMP-supported studies is due to the advent of Next Generation Sequencing (NGS) technologies that produce large volumes of sequenced reads cheaply. (Note that even though NGS makes it possible to cheaply sequence the entire genome instead of just the 16S region, it may not much help distinguishability of the microbes in the community because of the proliferation of highly conserved genes and the presence of horizontally transferred genes between members of the community.) NGS costs are affordable, but they are not low enough to allow for an experiment to be repeated a number of times. Thus, in silico studies have been conducted to assess the diversity of the amplicons using universal and HMP primers [4,[18][19][20]. Although primers such as UV34, HMPV35 and UV6 are predicted to result in high diversity of the amplicons, they are flawed and can be improved.
We see two ways of improving metagenomic studies. First, we propose an automated, general method of designing new primers based on the current content of a target database ( Table 2) that abide by primer design rules (Table 3). Since the method is automated, it will be easy to update the primers as new species are added to the database or a different target set is selected from the database. Second, we observe that different species are distinguishable by different variable regions. Given, the affordability of NGS sequencing, and the difficulty in designing primers that are both, universal and abide by primer design rules, we propose the use of multiple sets of primer pairs. These sets are automatically chosen so as to maximize the number of distinguishable species. In this paper we present a software package that achieves both objectives.

Algorithm
We start with some required notation. Let S denote a set (database) of sequences over the alphabet Σ = {A, C, G, T, R, Y, S, W, K, M, B, D, H, V, N}, which corresponds to the letters used in the IUPAC code (http://www.bioinformatics.org/sms/iupac.html). Each letter in Σ corresponds to a set of bases from the DNA alphabet {A, C, G, T}. For example, the letter H corresponds to the set {A, C, T}. Each letter of has a degeneracy value, which equals the cardinality of its base set. We denote this by DEG(a), for a ∈ . The degeneracy of a sequence S is defined as the product of    then DEG(S 1 ) = 1, DEG(S 2 ) = 32 and DEG(S 3 ) = 6. A sequence is called non-degenerate if its degeneracy equals 1. Degeneracy of a sequence is therefore a measure of the number of different non-degenerate sequences that it represents (or matches). Match. A letter, p, from a primer sequence is said to MATCH a letter, t, from a target sequence, if the set of bases corresponding to t is a subset of the set of the bases corresponding to p. Thus, for example, if the target base is R and the primer base is D then MATCH(D, R) = true, while match(R, D) = false. On the other hand, MATCH(G, H) = MATCH(H, G) = false. Extending this definition to sequences, a primer sequence P of length j is said to match a target sequence T, if there exists a location l, such that the letters of P match the corresponding letters of T (j, l). For example, for the sequences mentioned above, MATCH(S 2 , S 3 ) = false, while MATCH(S 2 , S 1 ) = true since S 2 matches the subsequence of S 1 of length 6 starting from location 2. Biologically, if a primer sequence matches a target sequence, then the primer sequence would hybridize to the target sequence at the location of the match.
Degenerate codes such as S = {C, G}, W = {A, T} and N = {A, C, G, T} mean different things in different contexts. A degenerate code N in the database (e.g. RDP) means that the value of that base is not precisely determined. Thus, an N suggests that the base is A or C or G or T and any choice of the base (other than N) for the primer at that position could fail to match. On the other hand, a degenerate code in the primer means that all possible bases are present in the primer at that position. Thus, an N suggests that the base is A and C and G and T, i.e., different copies of the primer with different bases in that position are used. Consequently, it would match any letter in the corresponding target location.
Common. Two letters from are said to be in COMMON if the sets of bases they represent have a non-empty intersection. Thus, for example, COMMON (R, D) = TRUE, while COMMON(G, H) = false. The definition can be extended to two sequences being in common as follows: Two sequences are said to be in COMMON if every letter of one of the sequences is in common with the corresponding letter from the other sequence.
Design parameters for optimal degenerate primers. As mentioned earlier, the PCR reaction requires a number of conditions to be satisfied by the primers in order to work well. The following parameters and their constraints have been compiled from various design manuals. (For example, see http://www.premierbiosoft.com/ tech_notes/PCR_Primer_Design.html.) We will refer to them henceforth as the Optimal Primer Design Rules.
(1) SELFDIMER(P) is defined as the length of the longest contiguous COMMON subsequence between the primer P and its reverse complement. Good primer design requires that this quantity be minimized.
(2) PRIMERDIMER(F, R) is defined as the length of the longest contiguous COMMON subsequence between the forward (F ) and the reverse (R) primers. This quantity should be minimized.
(3) AMPLICONLENGTH(F, R) is the length of sequence that starts with F and finishes with R. The amplicon length is defined in the range of 150-600 bp. The lower limit prevents amplification of conserved regions. The upper limit is determined by the current read length that sequencing technologies support.
(4) RUNLENGTH(P ) is the maximum number of consecutive occurrences of the same base from the set {A, C, G, T}. Thus, for the U336R primer in Table 1, the red colored subsequence has RUNLENGTH(U336R) = 6. . Good primer design requires that the entire range lies between 55°and 65°. It also requires that RANGETM(P) = MAXTM(P)-MINTM(P) not be more than 5°. Again, in practice, a slight deviation from these conditions is acceptable.
Instead of allowing an arbitrary sequence to be a candidate primer, we define some more terms that help us narrow down the search space for our candidate primers.
Viable Primer. A viable primer is a sequence that satisfies the optimal primer design rules from above.
Common Viable Primer for H . A common viable primer for a multiset of sequences H is a viable primer if it matches at least minSupport number of sequences in H .
Derivative. A sequence P is derived from a sequence T if MATCH(P, T) = true. (For example, P 1 = AWGCT is a derivative of T = ATGCT, but P 2 = ATGGT is not.) Amplicon from sequence S with Primer Pair (F, R). An amplicon A from sequence S is a contiguous subsequence of S defined by a viable primer pair (F, R). A must start with a subsequence that matches the reverse complement of the forward primer F and must end with a subsequence that matches the reverse primer R.
Reference Target Sequence T. A sequence T from a database of sequences S is said to be a reference target sequence if the forward and reference primers are derived from T.

Problem statement
The primary task of a viable primer is to match as many sequences as possible from a target database. For primer pairs, it is important to maximize the number of sequences that match both primers. In applications such as metagenomics, where the amplicons are sequenced and then used to identify the organism from which it originated, a third requirement is that amplicons generated by the primers be uniquely distinguishable.
The Degenerate Primer Design Problem. For a given set of parameters P , a set of sequences S and a reference target sequence T ∈ S , find a viable primer pair (F, R) that matches T for which the number of distinguishable amplicons generated by its primers (F, R) for the sequences in S is maximized.
In other words, given a database S (e.g., RDP), a reference sequence T (e.g, E. coli), and values for the design parameters, P , including MAXSELFDIMER, MAXPRI-MERDIMER, MAXRUNLENGTH, MINCG%, MAXCG%, MINTM, and MAXTM, develop a set of viable forward and reverse degenerate primer pairs with the maximum number of unique amplicons. The reference sequence T ensures that the number of possible primer sequences is finite. The observation below prunes the search space further.

Algorithm description
Observation 1 If primer P 1 is not viable then a primer P 2 derived from P 1 is not viable either.
Scanning for Candidate Forward and Reverse Primers. To make the search more efficient, we focus on primer sequences derived from the target sequence. Observation 1 is also used to prune away candidate primers that are not viable.
Algorithm SELECTPRIMERCANDIDATES shown here designs set F of viable common degenerate forward primers that correspond to all viable primers P from reference sequence T and sequence database S . A similar design is performed to generate the set of viable common reverse primers R . First, for each viable primer from T function FIND-BESTVALIDMATCH generates a set H of unique best valid matches in S . Note that not all target sequences contribute to H . Some target sequences might not have a valid match or unique best valid match for the relevant subsequence of T. Unique best valid match between a primer and a sequence. A match between a viable primer P and a sequence S i is valid if there exists a contiguous subsequence in S i of the same length as P and that has at most log 2 (MAXDEGENERACY) mismatches with P. If there is only one valid match between P and S i that has the least number of mismatches then that is the unique best valid match.
Algorithm DEVELOPDEGPRIMER develops a set of viable common primers corresponding to input multiset H . From the aligned set of hits H , we create a frequency count matrix for each position. If the highest frequency in a position (column) in the alignment is smaller than minSupport, then that position in the primer would require to be degenerate. We first check that the number of positions that require degeneracy is no greater than log 2 (MAXDEGENERACY). If it is, then the bound on degeneracy would be violated and we exit without computing a common degenerate primer. Otherwise, we construct the initial degenerate primer, degPrimer, with each of its bases set to the base that occurs most frequently in that position. If the initial degenerate primer is not a viable primer we exit since its derivates need not be considered by Observation 1. Starting from the initial primer, we remove all entries of H that match the current primer. To increase the number of hits in H , we use a greedy method to iteratively increase primer degeneracy, one base at a time, as follows. In each iteration, from the remaining sequences in H , we reconstruct the frequency count matrix, storing all positions that do not yet match the current primer, ordered by the frequency count of the highest frequency. We choose to increase degeneracy of a base with the highest frequency count, while making sure that the new primer remains viable. Ties are broken in favor of the base for which the increased degeneracy causes the least increase in overall degeneracy of the sequence. When a new common viable primer is found, it is added to the candidate primer set F .
Algorithm DESIGNPRIMERPAIR takes a set of viable forward primers and a set of viable reverse primers, groups each set into subregions, sorting each group by their number of hits in the database. To allow for better primer pairing with regard to distinguishabilty, each conserved region is further subdivided into subregions, defined by the primer start location. The sorted lists of forward and reverse primers are scanned to select the pair with maximum number of simultaneous hits in the database that abide by optimal design rules. Once we have primer pairs and their number of hits in the database, we focus on amplicon distinguishability. In the context of degenerate primer design, an amplicon is distinguishable if and only if it has a match to exactly one sequence in the database. In order to select sets of candidate primer pairs, we extend the notion of amplicon distinguishability to two or more primer pairs. A sequence is distinguishable if it produces a distinguishable amplicon with at least one of the primer pairs in the chosen set.
Function FINDDISTINGUISHTOTAL creates a taxon matrix to easily gather and compare results for each primer pair and across two (or more) primer pairs. Data can be generated for any combination of two primer pairs. It is possible to request only a number of best combination for a desired metric (distinguishable/total number of hits per genus/species).

In silico design of degenerate primers
We run the algorithm on sequences from the 16S rRNA RDP database (downloaded on February 12, 2012). The set is described in Table 2. We used the E. coli sequence as the template sequence and the set of parameters used for primer design and primer pair design are described in Table 3. The algorithm described above found 208 (244, resp.) viable common forward (reverse, resp.) primers from 8 forward and (10 reverse, resp.) regions. We note that there is no viable common primer in the conserved regions between variable regions one and two. Note that our procedure was able to identify the widely used universal primers, U337F and U518R, which abide by the optimal design rules. The second primer in Table 4 in region FV3 is a more degenerate version of U337F, and the second primer under the region RV3 is a degenerate version of U518R. Note that our algorithm found a forward primer better than U337F in the FV3 region with more hits and and lower SELFDIMER value. Similarly, a better reverse primer was found in the RV3 region with almost 400 more matches in RDP. Table 4 represents only a sample of the viable common primers found by the algorithm and represent the best primers in terms of the number of RDP hits. Table 4 shows a number of FV7 primers to illustrate that within a region two primers can have significantly different parameters, but still are valid options in the PRIMERPAIRDESIGN. In Table 5 we include only a selection of designed primer pairs, together with select primer pairs from HMP and Universal primer pairs for comparison. Comparing primer pairs spanning roughly the same regions, we note that primer pair F528-R1052 is better than HMP_V3-5 by most measures. Pair F345-R505 improves upon U337F-518R, and pair F345-R775/873 compares favorably to U337F-805R. Pairs F1064-R1367 or F1185-R1367 are by most measures better than HMPV69. Primer F1064 is a better substitute for U1100F primer, used as HMPV6F. The table also shows that different primer pairs can be chosen depending on what features are important to a user. For example, for region V8, primer pair F1185-R1367 would be chosen if phylum distinguishability is considered important, but not if genus or species distinguishability is considered important. Furthermore, the results also indicate that total number of hits per taxon and the corresponding distinguishability do not necessarily correlate (compare F528-R1052 and F797-R1052). The same can be said for the distinguishability between genus and species (compare F345-R922 and F528-R775).
To see whether and how amplicon length is correlated taxon distinguishability we ran the degenerate primer design algorithm with no upper bound for amplicon length. One may hypothesize that an increase in the number of variable regions in the amplicons generated by a primer pair will lead to an increase in taxon distinguishability. The best result was obtained with the first primer pair in Table 5 (F345-R1052). It flanks V3-V6 regions. Hence increase in amplicon length does not necessarily increase taxon distinguishability. We postulate that conserved regions of different species are not uniformly conserved. This brings us to the final step: combination of different primer pairs that extract different taxons to  increase taxon distinguishability. We include HMP and universal primers from Table 5 in the input data set. The results are presented in Table 6 below. In Table 6 the first two pairs of primer pairs are the best combination for distinguishing the most number of species and most number of genera, respectively, if amplicon length is not bounded.
If we allow amplicons of arbitrary lengths then designed primers can generate and distinguish all the Genera and can generate all the Species, while failing to distinguish 563. As the lengths of the sequenced reads are increased, the number of taxons not generated or distinguished by two primer pairs gets smaller. In particular, note that if we do not bound amplicon length, only 12 genera failed to be distinguished by a combination of V36-F348-R1050 and V47-F528-R1164, neither one of which is the best primer pair (V36-F345-R1052).

Laboratory experiments with designed primers
The best designed primers were tested in the laboratory. Needless to say, the performance of the primers will depend on the microbial composition of the samples on which these primers are applied. We tested these on DNA extracted from lung biolavage samples from the lungs of 8 (unidentified) individuals. The experiments were performed in the Mathee laboratory. PCR was performed on the extracted DNA and gel electrophoresis was used to isolate the amplicons of the appropriate length. The results were compared using BioAnalyzer (http://www.genomics.agilent.com/). Figure 1 shows gel images for a sample from a single patient with 11 different primers -3 HMP primers and 8 new primers from this paper. Under identical cnditions, the figure shows that 5 out of the 8 newly designed primers and 2 of the 3 HMP primers produced sufficient and comparable amplification. We then picked the best performing HMP primer and a MJ primer and tested them with 8 different patient samples. The read length distribution from the BioAnalyzer results in Figure 2 shows that the MJ primer produced more amplification than the HMP primer for most samples. More detailed analyses (not reported here) show that new genera (Alterococcus, Coxiella, Isosphaera, Leptolinea, Rubritalea, and Zavarzinella) and phyla (Armatimonadetes and Lentisphaerae) were detected by the new primers.

Discussion
The design paramater values shown in Table 3 are default values as suggested in the literature by PCR practitioners. However, the algorithm allows the user to change these values as they see fit. 16S rRNA variable regions are not uniformly variable across all species, i.e., for a group of species a particular variable region might be more conserved than others. Clearly, this differential variability of the regions affects their impact on the choice of primers and the ability of the primers to generate distinguishable amplicons.
Although intuitively it might seem that the greater the span of hypervariable regions of the amplicons from a primer pair, the greater would be its distinguishability, in reality it also depends on the specific taxa matched by the pair (F, R). A pair (F 1 , R 1 ) with fewer hits than pair (F 2 , R 2 ), and whose amplicon is contained in the amplicon for (F 2 , R 2 ) is likely to distinguish more taxa. Different 16S rRNA regions have different degrees of variability. However, it has been noted that the set of taxa distinguished by two different regions may vary substantially, thus providing a basis for combining primer pairs. The primer pairs that result in the most number of distinguishable taxa is not necessarily the result of combining primer pairs with the best individual results. To summarize, in order to obtain as complete a picture of a microbiome as possible the number of uniquely identifiable extracted species must be maximized. In other words what is needed are compatible forward and reverse primers that bind well to the same, large number of identifiable species, and produce amplicons that are species specific. Taxon distinguishability depends on (1) how many hits a forward/reverse primer has (2) how many hits a primer pair has (3) how variable is the region a primer pair flanks (4) how many primer pairs distinguish different taxa and (5) how many additional taxa can be distinguished by combining two (or more) primer pairs.
In comclusion, we have developed an algorithm that takes as input a set of user-defined values for the primer design parameters and the sequence database, and outputs customized primer pairs that attempt to maximize their ability to produce distinguishable amplicons for the given database. In silico experiments prove that the newly designed primers are better than the primers currently used by practitioners. Laboratory experiments confirm the efficacy of the newly designed primers. This tool will positively impact the design of primers for the Human Microbiome Project and other related metagenomics projects.

Dedication
Sadly we announce the untimely passing of Melita Jaric and dedicate this paper to her memory.