Version 8.2 Release Notes

From wgs-assembler
Jump to: navigation, search

These are release notes for Celera Assembler version 8.2, which was released on November 12th, 2014.

This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.

The source code package includes full source code (revision r4583), Makefiles, and scripts. A subset of the kmer package (, version r1993), used by some modules of Celera Assembler, is included. This distribution includes SAMtools, Jellyfish 2.0, PBUTGCNS, and parts of the Falcon assembler.

Full documentation can be found online at


Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007], haplotype separation [Levy et al. 2007; Denisov et al. 2008], a Sanger+pyrosequencing hybrid pipeline [Goldberg et al. 2006], native assembly of 454 data [Miller et al. 2008], and PacBio data [Koren et al. 2013]. There are links to these papers, and more, in the on-line documentation (

Compilation and Installation

Users can download Celera Assembler as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.

To use the source code, execute these commands on any unix-like platform:

bzip2 -dc wgs-8.2.tar.bz2 | tar -xf -
cd wgs-8.2
cd kmer && make install && cd ..
cd samtools && make && cd ..
cd src && make && cd ..
cd ..

To use the binary distributions, choose a platform, download that package, then unpack it with some unix command like this:

bzip2 -dc wgs-8.2-*.tar.bz2 | tar -xf -

In both cases, you can run the assembler with:


Changes in CA 8.2

Celera Assembler 8.2 includes major improvements to the PBcR pipeline:

  • Support for a rapid probabilistic overlap module for long high-error reads. PBcR using only PacBio data should now require roughly 30 minutes end-to-end (correction and assembly) for most prokaryotic genomes on a typical desktop machine. Hybrid correction with Illumina or other high-accuracy data remains unchanged from CA 8.1.
  • Support for additional consensus modules when assembling reads larger than 500bp.

Backward Compatibility

Celera Assembler 8.2 is compatible with Celera Assembler 8.1. Its intermediate files are generally incompatible with earlier versions of CA. Users should not run 8.2 software against earlier pipeline files, or earlier software on 8.2 pipeline files. Users should launch CA 8.2 assemblies from scratch.

Bug Fixes

  • Fix bug when correcting large datasets which would cause 1-overlap step to fail
  • Fix bug in consensus module for unitigs less than 500bp
  • Fix bug making 1-overlapper crash fail when an outdated java is available. Report java version error to user instead.

Known Problems

  • Scaffolder (CGW) is slow
  1. Some data sets are exhibiting enormous run times in the scaffolding module.
  • Invalid Results
  1. In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard' is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end libraries.
  • Scaling
  1. Creating an overlap store for billions of fragments is a significant bottleneck. This process can take multiple days, sometimes longer than the (parallel) computation of the overlaps.
  • Algorithmic limitations
  1. There is no explicit support for high-coverage. Coverage such as 80X induces combinations of sequencing errors that confound Celera Assembler. At best this leads to higher reported rates of allelic variation. At worst this leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly.
  2. There is little support for assembly of data sets with a small ratio of mate pairs to unpaired fragments. Mate pairs are used to detect misassmblies and to form contigs and scaffolds. Too few mate pairs can result in shattered assemblies and many bases in degenerate contigs.
  3. There is no support for bar-coded 454 data. Users with bar-coded data may use some other utility to remove the bar code sequence and partition reads by bar code into separate SFF files.
  4. There is no support for data from ABI SOLiD.
  5. There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort.


Copyright 1999-2004 by Applera Corporation. Copyright 2005-2013 by the J. Craig Venter Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source and available free of charge subject to the GNU General Public License, version 2.