Enhancing genome assemblies by integrating non-sequence based data
© Heider et al; licensee BioMed Central Ltd. 2011
Published: 28 April 2011
Many genome projects were underway before the advent of high-throughput sequencing and have thus been supported by a wealth of genome information from other technologies. Such information frequently takes the form of linkage and physical maps, both of which can provide a substantial amount of data useful in de novo sequencing projects. Furthermore, the recent abundance of genome resources enables the use of conserved synteny maps identified in related species to further enhance genome assemblies.
The tammar wallaby (Macropus eugenii) is a model marsupial mammal with a low coverage genome. However, we have access to extensive comparative maps containing over 14,000 markers constructed through the physical mapping of conserved loci, chromosome painting and comprehensive linkage maps. Using a custom Bioperl pipeline, information from the maps was aligned to assembled tammar wallaby contigs using BLAT. This data was used to construct pseudo paired-end libraries with intervals ranging from 5-10 MB. We then used Bambus (a program designed to scaffold eukaryotic genomes by ordering and orienting contigs through the use of paired-end data) to scaffold our libraries. To determine how map data compares to sequence based approaches to enhance assemblies, we repeated the experiment using a 0.5× coverage of unique reads from 4 KB and 8 KB Illumina paired-end libraries. Finally, we combined both the sequence and non-sequence-based data to determine how a combined approach could further enhance the quality of the low coverage de novo reconstruction of the tammar wallaby genome.
Using the map data alone, we were able order 2.2% of the initial contigs into scaffolds, and increase the N50 scaffold size to 39 KB (36 KB in the original assembly). Using only the 0.5× paired-end sequence based data, 53% of the initial contigs were assigned to scaffolds. Combining both data sets resulted in a further 2% increase in the number of initial contigs integrated into a scaffold (55% total) but a 35% increase in N50 scaffold size over the use of sequence-based data alone.
We provide a relatively simple pipeline utilizing existing bioinformatics tools to integrate map data into a genome assembly which is available at http://www.mcb.uconn.edu/fac.php?name=paska. While the map data only contributed minimally to assigning the initial contigs to scaffolds in the new assembly, it greatly increased the N50 size. This process added structure to our low coverage assembly, greatly increasing its utility in further analyses.
The arrangement of the tammar wallaby genome is unique, with the entire 2.7 Gb genome organized into seven pairs of autosomes and an X and Y [11, 12]. However, the X is relatively small in the tammar and the Y is tiny. It has been proposed that the tammar X represents the ancestral therian X chromosome, which has undergone several additions in the eutherian lineage. Likewise the Y in marsupials is thought to represent a minimal mammalian Y [13, 14]. The centromeres of the tammar chromosomes are also quite different from that of their eutherian relatives. Similar to centromeres in rice , the centromeres of the tammar are small, encompassing ~420kb, and are comprised of a heterogeneous repeat structure of interspersed satellites and centromeric retroelements . In addition, the alternative reproductive strategy of the tammar has placed different evolutionary pressures on the genome. Most notably, genomic imprinting, an epigenetic phenomenon in eutherian mammals thought, in part, to regulate fetal growth and nutrition in utero affects fewer genes and is less complex in marsupials [17, 18]. All of these features combined, make the tammar wallaby genome a particular interesting resource from an evolutionary as well as a developmental point of view.
A white paper to sequence the genome of the tammar wallaby was funded in 2004, in a joint venture between the National Human Genome Research Institute (National Institutes of Health) and the Australian Genome Research Facility Ltd. (http://www.genome.gov/Pages/Research/Sequencing/SeqProposals/WallabySEQ.pdf) to produce a 2× coverage genome. By the time the project was initiated, large amounts of non-sequence based data had already been collected to supplement Sanger sequence reads. Work continued with the non-sequence based data as a way to enhance the genome assembly. Subsequently, a comprehensive map of the wallaby genome was developed that integrated physical, linkage and synteny maps with 14,429 markers spanning the tammar genome . This abundance of markers was greatly enhanced by synteny maps constructed between the tammar wallaby and the genome of the opossum (another marsupial species). While a large number of model species with low coverage genomes already have genomic maps, including Canis familiaris, Felis catus and Ovis aries, none have such an extensive map as the wallaby, with the next largest map found in the dog project, including 4,249 markers . A well-developed map plays an important role in creating an accurate representation of the genome by providing a structure to which many smaller contigs can be attached and ordered to produce a whole chromosome. Thus, an extensive genomic map can help overcome many of the limitations of a low coverage genome and provide a platform for researchers to investigate repetitive regions, gain insight into chromosomal evolution and help to develop regulatory pathways that may depend on proximity for activity not possible without an enhanced assembly.
Integrating the nonsequence-based data
Summary statistics for scaffolding of sequence and nonsequence based data
Number of paired-end reads
Total scaffold span
N50 scaffold span
Number of scaffolds
Percentage of original contigs included into scaffolds
Integrating the sequence-based data
In addition to the physical map, there was also a wealth of information from next generation sequencing platforms that we wanted to integrate into the tammar genome assembly. The Illumina paired-end reads were mapped against the tammar wallaby genome using the short read mapping program Bowtie . Bowtie was run using default parameters with the modification that each read was allowed up to 3 mismatches. Furthermore, paired-end reads where one or both reads mapped to multiple locations were excluded from the analysis. The output from read mapping was stored in standard SAM format (http://bowtie-bio.sourceforge.net/manual.shtml#sam-bowtie-output) which indicates the contig, position and orientation for each mapped read and its mate. A perl pipeline was then constructed to convert this file into the required BAMBUS input files (http://sourceforge.net/apps/mediawiki/amos/index.php?title=Bambus_Manual). BAMBUS was then run using only the sequenced-based data and the output statistics collected (Table 1).
Combining both the nonsequence and sequence based data
Integrating the nonsequence-based data
From the 14,429 markers, 173,294 pseudo paired-end reads were generated across 6 libraries ranging from 5MB to 10MB intervals. The pseudo paired-end libraries increased the N50 from 36 KB in the original assembly, to 39 KB and included 2.2 % initial contigs in scaffolds. Based on the map-enhanced assembly alone, Bambus estimates the total scaffold span at 3.2GB, slightly larger than its predicted 2.7 GB size (A. Pask, personal communication).
Integrating the sequence-based data
Using the 0.5× paired-end Illumina sequence data the genome was greatly improved, Bambus was able to order and orient 53% of the initial contigs into scaffolds and increase the N50 to 78KB. Interestingly, despite fewer reads, the 4 KB Illumina paired-end data was more successful in increasing the number of contigs in scaffolds compared to the 8 KB data (40% compared to 27%); however the 8KB data integration produced a larger N50 size (52 KB compared to 49 KB for the 4kb library alone) (Table 1).
Combining both the non and sequence based data
Using both sequence and non-sequence based data, we were able to increase the number of initial contigs in scaffolds to 153,612, scaffolding 55% of the contigs from the original assembly. The inclusion of the mapping data to the Illumina paired-end libraries further increased the number of initial contigs being scaffolded by 5191. Thus, very few of the original 6024 initial contigs that the mapping data was able to include in a scaffold, were included using the Illumina paired-end libraries. Furthermore, the N50 for the assembly was greatly increased (by 35%) by the inclusion of the mapping data with the Illumina data (105 KB compared to 78 KB with Illumina data alone).
Since the N50 statistics for each of the analyses above were determined by using the total scaffold span for each analysis, they are not directly comparable. However, the total scaffold span for each analysis (with the exception of the use of the virtual map alone) is within the margin of error for our direct estimate of the genome size (2.7 GB +/- 10%; A. Pask personal communication). Furthermore, the comparatively small reduction in the estimated genome size concurrent with the inclusion of more paired end data, cannot alone account for the large increase in the N50 seen using this method.
Genome assemblies enhanced with non-sequence-based information (especially for low coverage genomes), provide a more workable resource for analysis and comparative genomics. Our bioinformatics pipeline provides a flexible, straightforward method of integrating non-sequence based data seamlessly into a modern genome project. The combination of, and ability to prioritize, sequence and non-sequence based data into an assembly gives this method robustness not found by simply mapping the assembled contigs to a virtual genome map. This prioritization allows the paired-end data to override the physical map when encountering small segmental inversions unique to a species or to even an individual. A small proportion (553 markers) of the virtual genome map was defined by fluorescence in situ hybridization (FISH) mapping which can identify the location of a gene on a specific chromosome within a few megabases. To address this precision limitation of FISH mapping, we constructed pseudo paired-end libraries starting with 5MB intervals to avoid any possibility of misinterpreting the order of the markers. Our findings showed that while the non-sequence data only marginally helped to increase the number of initial contigs scaffolded together, it is able to greatly improve the N50 size. The small increase in initial contigs scaffolded is not surprising given that the total number of paired-end reads generated from the map points was 173,294 compared to 20,133,999 paired-end reads from the Illumina data. In total, the virtual map contributed less than 1% of the data points used in the combined assembly but added 4% of the contigs to a scaffold. Therefore the mapping data, even if utilizing a limited number of datapoints, can provide a useful means for increasing the N50. This is likely due to the interval between paired reads (5-10 MB), which far exceeds the current capabilities of next generation sequencing and can provide a higher order structure to the genome assembly. In addition, the mapping data allows direct assignment of the initial contigs to the chromosomes, providing valuable information beyond that of sequence data alone and further enhancing the accuracy of the final assembly.
The method we describe herein provides a simple pipeline for the inclusion of non-sequence based data into a genome. Integrating data from more than one source (sequence based and map based) advances the robustness and confidence of any genome assembly. Map data is able to anchor contigs to chromosomes further improving the genome assembly. While the integration of over 14,000 map points was only able to enhance the genome assembly by 2.2% in the tammar wallaby, its inclusion with Illumina paired-end data was able greatly increase the N50 of the genome (35% above that generated from the Illumina reads alone). Given the high cost and time commitment of constructing a physical map, we would not recommended the use of extensive FISH and linkage mapping to improve a genome assembly over generating a low coverage of paired-end data. However, as the number and diversity of genomes continue to increase in public databases, closely related genomes can be used at virtually no cost to generate extensive synteny maps. Such maps can be used in the method described here for increasing the size of scaffolds, allowing assemblies to span large stretches of repetitive DNA that paired-end libraries from the current next generation sequencing platforms are not able to cross. We suggest that, together with paired-end data, this novel method can greatly enhance the assembly of a low coverage genome project improving its utility for further analyses.
We thank ARC Centre of Excellence for Kangaroo Genomics for access to sequence data. We gratefully acknowledge Craig Obergfell’s assistance in generating the Illumina reads and assisting in the genome assembly.
This article has been published as part of BMC Proceedings Volume 5 Supplement 2, 2011: Proceedings of the 6th International Symposium on Bioinformatics Research and Applications (ISBRA'10). The full contents of the supplement are available online at http://www.biomedcentral.com/1753-6561/5?issue=S2.
- Tyndale-Biscoe CH, Renfree MB: Reproductive physiology of marsupials. 1987, Cambridge Cambridgeshire; New York: Cambridge University PressView ArticleGoogle Scholar
- Tyndale-Biscoe CH, Hearn JP, Renfree MB: Control of reproduction in macropodid marsupials. J Endocrinol. 1974, 63: 589-614. 10.1677/joe.0.0630589.View ArticlePubMedGoogle Scholar
- Renfree MB: Marsupials: Alternative mammals. Nature. 1981, 293: 100-1. 10.1038/293100a0.View ArticlePubMedGoogle Scholar
- Renfree MB, Pask AJ, Shaw G: Sex down under: the differentiation of sexual dimorphisms during marsupial development. Reprod Fertil Dev. 2001, 13: 679-90. 10.1071/RD01096.View ArticlePubMedGoogle Scholar
- Bininda-Emonds OR, Cardillo M, Jones KE, MacPhee RD, Beck RM, Grenyer R, Price SA, Vos RA, Gittleman JL, Purvis A: The delayed rise of present-day mammals. Nature. 2007, 446: 507-12. 10.1038/nature05634.View ArticlePubMedGoogle Scholar
- Frankenberg S, Pask A, Renfree MB: The evolution of class V POU domain transcription factors in vertebrates and their characterisation in a marsupial. Dev Biol. 2010, 337: 162-70. 10.1016/j.ydbio.2009.10.017.View ArticlePubMedGoogle Scholar
- Pask A, Graves JA: Sex chromosomes and sex-determining genes: insights from marsupials and monotremes. EXS. 2001, 71-95.Google Scholar
- Pask AJ, Behringer RR, Renfree MB: Resurrection of DNA function in vivo from an extinct genome. PLoS One. 2008, 3: e2240-10.1371/journal.pone.0002240.PubMed CentralView ArticlePubMedGoogle Scholar
- Yu H, Pask AJ, Shaw G, Renfree MB: Comparative analysis of the mammalian WNT4 promoter. BMC Genomics. 2009, 10: 416-10.1186/1471-2164-10-416.PubMed CentralView ArticlePubMedGoogle Scholar
- Pask A, Renfree MB, Marshall Graves JA: The human sex-reversing ATRX gene has a homologue on the marsupial Y chromosome, ATRY: implications for the evolution of mammalian sex determination. Proc Natl Acad Sci U S A. 2000, 97: 13198-202. 10.1073/pnas.230424497.PubMed CentralView ArticlePubMedGoogle Scholar
- Toder R, O'Neill RJ, Wienberg J, O'Brien PC, Voullaire L, Marshall-Graves JA: Comparative chromosome painting between two marsupials: origins of an XX/XY1Y2 sex chromosome system. Mamm Genome. 1997, 8: 418-22. 10.1007/s003359900459.View ArticlePubMedGoogle Scholar
- O'Neill RJ, Eldridge MD, Toder R, Ferguson-Smith MA, O'Brien PC, Graves JA: Chromosome evolution in kangaroos (Marsupialia: Macropodidae): cross species chromosome painting between the tammar wallaby and rock wallaby spp. with the 2n = 22 ancestral macropodid karyotype. Genome. 1999, 42: 525-30.View ArticlePubMedGoogle Scholar
- Graves JA: Sex chromosome specialization and degeneration in mammals. Cell. 2006, 124: 901-14. 10.1016/j.cell.2006.02.024.View ArticlePubMedGoogle Scholar
- Pask A, Graves JA: Sex chromosomes and sex-determining genes: insights from marsupials and monotremes. Cell Mol Life Sci. 1999, 55: 864-75. 10.1007/s000180050340.View ArticlePubMedGoogle Scholar
- Yan H, Talbert PB, Lee HR, Jett J, Henikoff S, Chen F, Jiang J: Intergenic locations of rice centromeric chromatin. PLoS Biol. 2008, 6: e286-10.1371/journal.pbio.0060286.PubMed CentralView ArticlePubMedGoogle Scholar
- Carone DM, Longo MS, Ferreri GC, Hall L, Harris M, Shook N, Bulazel KV, Carone BR, Obergfell C, O'Neill MJ, O'Neill RJ: A new class of retroviral and satellite encoded small RNAs emanates from mammalian centromeres. Chromosoma. 2009, 118: 113-25. 10.1007/s00412-008-0181-5.View ArticlePubMedGoogle Scholar
- Renfree MB, Hore TA, Shaw G, Graves JA, Pask AJ: Evolution of genomic imprinting: insights from marsupials and monotremes. Annu Rev Genomics Hum Genet. 2009, 10: 241-62. 10.1146/annurev-genom-082908-150026.View ArticlePubMedGoogle Scholar
- Renfree MB, Papenfuss AT, Shaw G, Pask AJ: Eggs, embryos and the evolution of imprinting: insights from the platypus genome. Reprod Fertil Dev. 2009, 21: 935-42. 10.1071/RD09092.View ArticlePubMedGoogle Scholar
- Wang C, Deakin JE, Zenger KR, Belov K, Graves JAM, Nicholas FW: An integrated tammar wallaby map and its use in creating a virtual tammar wallaby genome map. BMC Genomics.Google Scholar
- Breen M, Hitte C, Lorentzen T, Thomas R, Cadieu E, Sabacan L, Scott A, Evanno G, Parker H, Kirkness E, et al: An integrated 4249 marker FISH/RH map of the canine genome. BMC Genomics. 2004, 5: 65-10.1186/1471-2164-5-65.PubMed CentralView ArticlePubMedGoogle Scholar
- Kent WJ: BLAT – the BLAST-like alignment tool. Genome Res. 2002, 12 (4): 656-664.PubMed CentralView ArticlePubMedGoogle Scholar
- Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, et al: The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002, 12 (10): 1611-1618. 10.1101/gr.361602.PubMed CentralView ArticlePubMedGoogle Scholar
- Pop M, Kosack DS, Salzberg SL: Hierarchical scaffolding with Bambus. Genome Res. 2004, 14 (1): 149-159. 10.1101/gr.1536204.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10: R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.