Volume 6 Supplement 7
Proceedings of the Great Lakes Bioinformatics Conference 2012
Effect of conformation sampling strategies in genetic algorithm for multiple protein docking
 Juan EsquivelRodríguez^{1} and
 Daisuke Kihara^{2, 1}Email author
DOI: 10.1186/175365616S7S4
© EsquivelRodríguez and Kihara; licensee BioMed Central Ltd. 2012
Published: 13 November 2012
Abstract
Background
Macromolecular protein complexes play important roles in a cell and their tertiary structure can help understand key biological processes of their functions. Multiple protein docking is a valuable computational tool for providing structure information of multimeric protein complexes. In a previous study we developed and implemented an algorithm for this purpose, named MultiLZerD. This method represents a conformation of a multimeric protein complex as a graph, where nodes denote subunits and each edge connecting nodes denotes a pairwise docking conformation of the two subunits. MultiLZerD employs a genetic algorithm to sample different topologies of the graph and pairwise transformations between subunits, seeking for the conformation of the optimal (lowest) energy. In this study we explore different configurations of the genetic algorithm, namely, the population size, whether to include a crossover operation, as well as the threshold for structural clustering, to find the optimal experimental setup.
Methods
MultiLZerD was executed to predict the structures of three multimeric protein complexes, using different population sizes, clustering thresholds, and configurations of mutation and crossover. We analyzed the impact of varying these parameters on the computational time and the prediction accuracy.
Results and conclusions
Given that computational resources is a key for handling complexes with a large number of subunits and also for computing a large number of protein complexes in a genomescale study, finding a proper setting for sampling the conformation space is of the utmost importance. Our results show that an excessive sampling of the conformational space by increasing the population size or by introducing the crossover operation is not necessary for improving accuracy for predicting structures of small complexes. The clustering is effective in reducing redundant pairwise predictions, which leads to successful identification of nearnative conformations.
Background
The tertiary structure of proteins provides valuable information about the mechanisms of protein function, however, structures of multimeric protein complexes are often difficult to solve by experimental methods. Even in the cases that the structure of an entire complex has not been determined, the structure of the individual subunits are often available, either because they have been solved experimentally or computationally modeled. MultiLZerD is a multiple protein docking protocol developed by our group [1–3], which takes structures of individual subunits and assembles them into complex models. The method was shown to be able to construct nearnative structures both for bound and unbound docking cases. It was shown MultiLZerD achieved overall better performance than a competitive method especially in unbound docking of multiple subunits [3].
As it is typical for stochastic evolutionary optimization approaches, the parameters used to execute the simulations affect the accuracy of the final results as well as the computational cost. In particular, we would like to optimize the number of random graphs used as the initial input and those kept at each GA generation (the population size) as well as the population variability given by the mutation and the crossover operators. Using these two operators increases the population size temporarily, before selecting the best fitted structures that are passed to the subsequent generation. The computation of a physicsoriented fitness function can have a significant cost if the population size is too large. Oppositely, if the population size is too small, it may require a large number of generations to find a nearnative model. We also tested different pairwise docking clustering thresholds. The pairwise docking decoys for each pair of subunits are clustered based on the root mean square deviation (RMSD) of Cα atoms. The procedure implemented is based on the clustering ideas conceived in ClusPro [5]. Two complexes are considered to be neighbors if they are closer than a threshold value. Once the clusters are created, the decoy with the best shapebased score is selected out of each cluster as a representative structure. The other members in the clusters will be deleted.
In this work, we analyzed the effect of the three parameters in the GA optimization in MultiLZerD, namely, the population size, whether to use the crossover operation, and the threshold value in the structural clustering (Figure 1). The results suggest that an excessive sampling of the conformational space is not necessary in our multipledocking procedure to find correct structure models.
Methods
We assessed the effect of different GA parameters used in MultiLZerD on the computational time and the accuracy. The original MultiLZerD is configured as follows: Given a set of tertiary structures of subunits of a protein complex, the pairwise protein docking algorithm, LZerD [4], is run to produce 54,000 docking candidates (decoys). A conformation of a whole protein complex is uniquely defined by a spanning tree (graph) where each node is connected to at least another node by an edge. This is suitable for constructing a multiple docking complex from pairwise decoys because not all pairs of nodes need to be connected. An edge between a pair of subunits specifies one of 54,000 pairwise docking decoys. Starting with M spanning trees, with M being 200 in the original setting, various alternative conformations are generated by GA with mutation or crossover operations. The mutation deletes one of the spanning tree edges and then selects a new edge randomly to reconnect the graph. It is possible that the same edge is selected again. Then, for a newly selected edge, one of the pairwise docking decoys for the two subunits is randomly selected. The rest of the edges remain unaltered. The crossover takes two candidate structures in the current population and creates a new individual by combining edges from the two parents. It will first create an empty graph and randomly select edges from the parents until a spanning tree is created. The decoys are subject to clustering with a predefined threshold value of RMSD. Finally, M = 200 decoys with the best fitness scores are selected for the next generation. In the final generation, clustering will yield at most M = 200 decoys ranked by their fitness score as the final prediction.
Two experiments were performed (Figure 1). The first experiment was to examine the effect of using the crossover operation together with the mutation operator in the GA. We executed MultiLZerD in two different settings:
1) Enable both mutation and crossover operations. Decoys of a population size of M (200) were subject to the application of the crossover operation, which was set to increase the population by 50% (i.e. 300). Then, the mutation operation is applied, which was configured to double the population size (i.e. 300 × 2 = 600). The 600 decoys were clustered and ranked by the fitness score, and the top 200 decoys were passed to the next iteration.
2) Enable mutation and disable crossover. For a population of 200 decoys, the mutation operation increases it to 400, which were subject to the clustering and the selection by the fitness score.
In both 1) and 2), the withingeneration clustering was performed using a threshold value of 10 Å.
The second experiment is to examine the effect of different population sizes, 200, 400, 600, and 800 at each GA generation. Obviously, increasing the population size is costly but at the same time near native models may be found at an earlier GA generation. In this experiment, only the mutation operation was used and the threshold of the withingeneration clustering was set to 10 Å.
We have also examined different clustering cutoffs for pairwise docking decoys computed by LZerD. Three pairwise clustering settings were tested, using either 5 Å or 10 Å Cα RMSD, or without using clustering (Figure 1). For the last setting, all 54,000 pairwise predictions generated by LZerD were used. The three clustering settings are tested in combination with the four different population sizes.
Results
Impact of using the crossover operator
In the case of 1ES7, the GA run with crossover generated a near native structure at an earlier generation of around 600. In contrast, the GA without crossover found a near native structure at around the 900^{th} generation. Although the GA with crossover found the nearnative structure earlier, the actual computational cost was similar because using the crossover increases the population size at each generation by 1.5 times, which increases the clustering cost by 1.5^{2} = 2.25 times, as we will discuss in the computational cost section. In the next case (Figure 3B), the GA run without crossover quickly found the structures between 68 Å RMSD, starting with 7.53 Å at the 308th generation. However, the GA with crossover found a better structure at the end around the 3000^{th} generation. In the last example (Figure 3C), the GA with crossover identified a near native structure at a significantly earlier generation (136th) than without crossover, which yielded a nearnative structure at the 2689th.
Overall, the GA with crossover found a nearnative structure at an earlier generation for two out of three cases. However, in one case (1ES7, Figure 3A) the actual computational cost for the GA with and without crossover was comparable.
Impact of the population size and the clustering threshold
Summary of predictions using different population sizes and clustering thresholds
No Pairwise Clusters  5Å Pairwise Clusters  10Å Pairwise Clusters  

PDB  Population Size  RMSD (Å)^{1}  fnat^{2}  Rank  RMSD (Å)^{1}  fnat^{2}  Rank  RMSD (Å)^{1}  fnat^{2}  Rank 
1ES7  200  17.51  0.03  20  17.94  0.03  16  1.86  0.92  3 
400  9.85  0.25  159  7.95  0.63  47  1.86  0.92  3  
600  14.21  0.00  445  9.56  0.34  265  1.86  0.92  3  
800  2.15  0.70  18  7.23  0.64  2  1.86  0.92  3  
1K2X  200  18.52  0.00  84  21.20  0.01  61  6.05  0.49  74 
400  20.36  0.01  239  6.05  0.49  80  23.69  0.01  195  
600  6.14  0.46  250  2.12  0.51  1  23.41  0.01  599  
800  19.45  0.01  56  19.26  0.02  718  6.04  0.49  60  
1WWW  200  19.16  0.02  51  13.49  0.11  152  8.62  0.58  23 
400  5.34  0.52  10  15.58  0.00  115  2.48  0.78  12  
600  17.37  0.03  248  15.76  0.03  182  11.12  0.39  13  
800  16.30  0.00  160  17.19  0.02  615  2.48  0.78  14 
The first observation is that results using the 10 Å cutoff showed the best RMSD models in ten out of twelve cases as compared with results without clustering and with the 5 Å cutoff. Comparing the no clustering and the 5 Å cutoff, the latter performed better than the former for seven out of twelve cases. The results indicate a better result can be expected in general when a larger cutoff is used for clustering for reducing redundancy to efficiently explore the conformational space.
As for the population size, we did not observe a clear trend relative to the best RMSD. Thus, we temporarily conclude that the population size of 200 or 400 is sufficient for the multiple docking by MultiLZerD for the complexes of four chains. Although we did not observe improvement of prediction accuracy by increasing the population size for the current dataset, a larger population size may work better for complexes of a larger number of chains.
Computational cost
The computational time of the GA optimization in MultiLZerD consists mainly of the pairwise docking by LZerD along with the pairwise clustering step, and the followings at each GA generation: the calculation of the fitness score for each decoy, the mutation operation for each decoy, the crossover operation for pairs of decoys, and the clustering step performed at each generation. The time complexity of computing the fitness score, the mutation operation, and the crossover operation is linear in the number of decoys, while it is quadratic for the clustering step because it needs pairwise RMSD values between all decoy pairs in the population. Thus, the overall time complexity is quadratic in the number of decoys in the population. The use of the crossover operation has a significant impact in the overall computational time because it will increase the number of decoys in the population that are subject to the clustering.
Conclusions
MultiLZerD employs GA for exploring the conformation space and there are several key parameters that can critically affect the prediction performance. Since computing the fitness score is a timeconsuming step, it is desired to keep the population of decoys in each GA generation small, but at the same time the algorithm should be able to explore the conformation space sufficiently to find nearnative models. From the testing of MultiLZerD on the three protein complexes, we found that the population size of 200 or 400 is sufficient (Table 1). The clustering is effective in reducing redundant pairwise predictions, which leads to successful identification of nearnative conformations. Using the crossover operation yielded a nearnative structure in an earlier generation than without using the crossover; however, they yielded similar final prediction results after a larger number of iterations. The current study leads to the conclusion that an excessive sampling of the conformational space is not necessary for small protein complexes (around four subunits) to find correct structure models in the MultiLZerD scheme. However, it is not clear if the conclusion applies to complexes with a larger number of subunits since their conformational space can be significantly larger than smaller ones.
List of abbreviations used
 • GA:

Genetic Algorithm
 • fnat:

Fraction of native contacts
 • PDB:

Protein Data Bank
 • RMSD:

Root mean square deviation.
Declarations
Acknowledgements
This work was supported by the National Institute of General Medical Sciences of the National Institutes of Health (R01GM075004, R01GM097528) and the National Science Foundation (EF0850009, IIS0915801, IOS1127027) and National Research Foundation of Korea Grant (NRF2011220C00004). JER is a Fulbright Science and Technology Fellow.
This article has been published as part of BMC Proceedings Volume 6 Supplement 7, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2012. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcproc/supplements/6/S7.
Authors’ Affiliations
References
 EsquivelRodriguez J, Kihara D: Evaluation of multiple protein docking structures using correctly predicted pairwise subunits. BMC Bioinformatics. 2012, 13: S6PubMed CentralView ArticlePubMedGoogle Scholar
 EsquivelRodriguez J, Kihara D: Fitting multimeric protein complexes into electron microscopy maps using 3D Zernike descriptors. J Phys Chem B. 2012, 116: 68546861. 10.1021/jp212612t.PubMed CentralView ArticlePubMedGoogle Scholar
 EsquivelRodriguez J, Yang YD, Kihara D: MultiLZerD: multiple protein docking for asymmetric complexes. Proteins. 2012, 80: 18181833.PubMed CentralPubMedGoogle Scholar
 Venkatraman V, Yang YD, Sael L, Kihara D: Proteinprotein docking using regionbased 3D Zernike descriptors. BMC Bioinformatics. 2009, 10: 40710.1186/1471210510407.PubMed CentralView ArticlePubMedGoogle Scholar
 Comeau SR, Gatchell DW, Vajda S, Camacho CJ: ClusPro: an automated docking and discrimination method for the prediction of protein complexes. Bioinformatics. 2004, 20: 4550. 10.1093/bioinformatics/btg371.View ArticlePubMedGoogle Scholar
 Janin J: Proteinprotein docking tested in blind predictions: the CAPRI experiment. Mol Biosyst. 2010, 6: 23512362. 10.1039/c005060c.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.