This poster presents the Bonobo assembly generated from 454 FLX and Titanium fragment and paired end data with Celera Assembler software.
The poster was presented at a conference: The 18th annual international conference on Intelligent Systems for Molecular Biology ( ISMB) in Boston, MA on July 11-15, 2010.
Bonobo genome de novo assembly generated by CABOG
- Jason Miller, Brian Walenz, Sergey Koren, Granger Sutton : The J. Craig Venter Institute.
The Bonobo Sequencing Consortium sequenced the genome of the great ape Pan paniscus. Using sequencers provided by 454 Life Sciences, the consortium produced over 250 million pyrosequencing reads.
Our group generated a de novo assembly of this large data set. The result contains 2695 scaffolds of length 2Kbp or more spanning 2.858Gbp. The N50 statistics are 9.6Mbp scaffold and 67Kbp contig. The consortium has begun comparative genomics analysis of the primate lineage using this high-quality draft assembly.
The assembly pipeline used Celera Assembler’s pyrosequencing variant known as CABOG1. For its initial contig construction, the pipeline used the Best Overlap Graph (BOG) algorithm. BOG uses a simple heuristic to build a reduced graph of reads and overlaps. It applies other heuristics to detect repeat-induced path intersections in the graph, thus avoiding mis-assembly. Under certain assumptions, the BOG algorithm is equivalent to one of its predecessors, the transitive edge reduction (TER) algorithm. TER is exact but more costly to run. Analysis of the bonobo assembly reveals that BOG resolved more graph tangles than is possible by TER. Comparison to the human genome offers confirmation of the BOG output. BOG reduced CPU and RAM computational costs for the bonobo assembly by reducing graph complexity while generating larger initial contigs and spanning more sub-read-length repeats.
Value of the Algorithm
|Large unitigs hit by overlap||36,526||56,788||56,900|
|Number of intersections...|
|with unitigs of 3+ reads||43,965||103,782|
|with unitigs of 2+ reads||53,398||149,750|
|with unitigs of 1+ reads||53,911||160,704|
Intersections are the decision points for the algorithm, which is designed to choose a single path through some of them. The number of intersections spanned indicates the value of the algorithm for this data set. Each intersection is a best overlap between a read in a unitig and a read not in the unitig. “Inward overlaps” originate from reads outside the unitig. “Outward overlaps” originate from within the unitig. Output unitigs may span zero or more of both types of overlap.
Table values were inferred from the Celera Assembler output. To get at the substantive decision points, we considered only intersections with internal reads – those at least 1Kbp from either end -- of large unitigs. Large was defined as 10Kbp or more. There were 81,588 large unitigs in the bonobo assembly.
Conclusion: The algorithm decided to span about 3 intersections for every large unitig that it formed. Since the predecessor algorithms would have split unitigs at most of these intersections, the new algorithm may have quadrupled unitig size.
Validation by Alignment to Human
|Alignment to human||Large Unitigs||ATAC Alignments||Fraction|
The validation used human, a close relative of bonobo (common ancestor ~6.5 mya) with a high-quality genome assembly (hg19). Bonobo unitigs of length 10Kbp or greater (average length = 19089bp) were aligned to human. Alignment used the ATAC software3. ATAC seeds alignments with maximal unique, one-to-one, indel-free matches. ATAC build chains of one or more seeds. ATAC chains are mutually coherent: seeds are same-sequence, same-strand, and monotonically ordered. ATAC reports maximal non-overlapping chains. ATAC does not align 100% repetitive sequences.
Result: Over 98% of the bonobo large unitig sequence aligned to human. Of the 19 untigs with segmented alignments, 6 unitigs are probable mis-assemblies in that they align to 2 chromosomes or 2 distant loci. The other 13 segmented alignments cover tandem loci in human with one segment inverted. These could reveal true evolutionary inversions. (None of the 19 have intersecting overlaps within 1Kbp of the internal alignment breakpoints.) Investigation of the 429 unitigs with no ATAC alignment indicates they are valid assemblies of genomic repeats. Indeed, other aligners mapped them to multiple human loci.
Conclusion: Nearly all large bonobo unitigs are validated by alignment to human. The heuristic BOG algorithm appears aggressive but accurate even on complex genome data. By capturing half of the genome in large unitigs, it proved valuable to the assembly of the bonobo shotgun sequencing data.