Cucumber Poster

From wgs-assembler
Jump to: navigation, search

This poster presents the Cucumber assembly generated from 454 Titanium PE data with Celera Assembler software.


The poster was presented at a conference: Sequencing, Finishing and Analysis in the Future (SFAF) in Santa Fe, New Mexico on 26-29 May 2009.


Shotgun Assembly of a Repetitive Plant Genome


  • Jason Miller, Sergey Koren, Brian Walenz, Granger Sutton : The J. Craig Venter Institute.
  • James Knight, Thomas Jarvie, Chinnappa Kodira, Jason Affourtit, Tim Harkins : 454 Life Sciences.
  • Yiqun Weng : USDA-ARS; University of Wisconsin, Madison.


Genomic repeats prevent accurate reconstruction by the whole-genome shotgun assembly method. Repeats are particularly troublesome for next-generation sequencing (NGS) approaches. Since they generate reads shorter than Sanger reads, NGS sequencing offers less power to resolve repeats.

We report the NGS sequencing and shotgun assembly of a repetitive plant genome. Cucumis sativus Gy14 is an inbred line of cucumber. Its genome had been predicted to be diploid in 7 chromosomes spanning ~367Mbp. Approximately half the genome was thought to be composed of heterochromatic repeats.

The genome was sequenced at 454 Life Sciences on the GS FLX Titanium platform with the XLR70 kit. Sequencing yielded 24.76M unpaired reads plus 11.44M paired-end reads from 3Kbp and 20Kbp libraries. Concordant assemblies were generated independently by two software applications, Newbler and Celera Assembler. Sequence alignment put 94% of both assemblies in one-to-one ungapped alignments and 99% in gapped alignments. The assemblies each contain about 200Mbp of consensus scaffold sequence. Of 427 available cucumber mRNA sequences, 98% mapped to the Newbler assembly at a 90% identity threshold.

Both assemblies contain partially assembled sequence unassigned to scaffolds. Analysis of K-mer content revealed the unassigned sequence is repetitive and dissimilar from the scaffold population. A majority of the unassigned sequence maps to a small span of the scaffolds. We conclude that some scaffolds enter long tracts of genomic repeat that are otherwise left unassembled. Despite the repetitive nature of this genome, our methods segregated the heterochromatin and resolved the euchromatic sequence.


             Max          N10                N25                N50              N95               Total 
             length       count  length      count  length      count  length    count  length     count  length
             =========    ====== =========   ====== ========    ====== ======    ====== ======     ====== ===========
Scaffolds    4,649,969     6     2,717,851    22   1,506,488     68    815,116     363  73,997     3,610  202,566,885
Contigs        522,294    70       219,018   242     146,370    690     87,421   3,121  10,379     7,901  200,010,660

Cucumber assembly statistics. Size statistics in base pairs for the Celera Assembler result. The input data were 454 FLX Titanium whole-genome shotgun reads from cucumber genomic DNA. The N statistics give cumulative counts of the largest contigs & scaffolds. For instance, the N10 scaffold statistics show that the 6 largest scaffolds account for 10% of the assembled bases, and all of those scaffolds are 2.7 Mbp or longer.


Cucumber Poster.