This poster presents the Cucumber assembly generated from 454 Titanium PE data with Celera Assembler software.
The poster was presented at a conference: Sequencing, Finishing and Analysis in the Future (SFAF) in Santa Fe, New Mexico on 26-29 May 2009.
Shotgun Assembly of a Repetitive Plant Genome
- Jason Miller, Sergey Koren, Brian Walenz, Granger Sutton : The J. Craig Venter Institute.
- James Knight, Thomas Jarvie, Chinnappa Kodira, Jason Affourtit, Tim Harkins : 454 Life Sciences.
- Yiqun Weng : USDA-ARS; University of Wisconsin, Madison.
Genomic repeats prevent accurate reconstruction by the whole-genome shotgun assembly method. Repeats are particularly troublesome for next-generation sequencing (NGS) approaches. Since they generate reads shorter than Sanger reads, NGS sequencing offers less power to resolve repeats.
We report the NGS sequencing and shotgun assembly of a repetitive plant genome. Cucumis sativus Gy14 is an inbred line of cucumber. Its genome had been predicted to be diploid in 7 chromosomes spanning ~367Mbp. Approximately half the genome was thought to be composed of heterochromatic repeats.
The genome was sequenced at 454 Life Sciences on the GS FLX Titanium platform with the XLR70 kit. Sequencing yielded 24.76M unpaired reads plus 11.44M paired-end reads from 3Kbp and 20Kbp libraries. Concordant assemblies were generated independently by two software applications, Newbler and Celera Assembler. Sequence alignment put 94% of both assemblies in one-to-one ungapped alignments and 99% in gapped alignments. The assemblies each contain about 200Mbp of consensus scaffold sequence. Of 427 available cucumber mRNA sequences, 98% mapped to the Newbler assembly at a 90% identity threshold.
Both assemblies contain partially assembled sequence unassigned to scaffolds. Analysis of K-mer content revealed the unassigned sequence is repetitive and dissimilar from the scaffold population. A majority of the unassigned sequence maps to a small span of the scaffolds. We conclude that some scaffolds enter long tracts of genomic repeat that are otherwise left unassembled. Despite the repetitive nature of this genome, our methods segregated the heterochromatin and resolved the euchromatic sequence.
Max N10 N25 N50 N95 Total length count length count length count length count length count length ========= ====== ========= ====== ======== ====== ====== ====== ====== ====== =========== Scaffolds 4,649,969 6 2,717,851 22 1,506,488 68 815,116 363 73,997 3,610 202,566,885 Contigs 522,294 70 219,018 242 146,370 690 87,421 3,121 10,379 7,901 200,010,660
Cucumber assembly statistics. Size statistics in base pairs for the Celera Assembler result. The input data were 454 FLX Titanium whole-genome shotgun reads from cucumber genomic DNA. The N statistics give cumulative counts of the largest contigs & scaffolds. For instance, the N10 scaffold statistics show that the 6 largest scaffolds account for 10% of the assembled bases, and all of those scaffolds are 2.7 Mbp or longer.