Version 8.0 Release Notes

From wgs-assembler
Jump to: navigation, search

These are release notes for Celera Assembler version 8.0, which was released on November 5th, 2013.

This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.

The source code package includes full source code (revision r4466), Makefiles, and scripts. A subset of the kmer package (, version r1969), used by some modules of Celera Assembler, is included.

This package was prepared by scientists at the J. Craig Venter Institute ( with funding provided by the National Institutes of Health (

Full documentation can be found online at


Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007], haplotype separation [Levy et al. 2007; Denisov et al. 2008], a Sanger+pyrosequencing hybrid pipeline [Goldberg et al. 2006] and native assembly of 454 data [Miller et al. 2008]. There are links to these papers, and more, in the on-line documentation (

Compilation and Installation

Users can download Celera Assembler as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.

To use the source code, execute these commands on any unix-like platform:

% bzip2 -dc wgs-8.0.tar.bz2 | tar -xf -
% cd wgs-8.0
% cd kmer
% gmake install
% cd ../src
% gmake
% cd ..

To use the binary distributions, choose a platform, download that package, then unpack it with some unix command like this:

% bzip2 -dc wgs-8.0-*.tar.bz2  |  tar  -xf  -

In both cases, you can run the assembler with:

% wgs-8.0/*/bin/runCA

Changes in CA 8.0

The major changes between 7.0 and 8.0 are on this page under bug fixes or improvements. See the full list of changes, which includes bugs that were introduced and then fixed during the upgrade from 7.0 to 8.0.

Backward Compatibility

Celera Assembler 8.0 uses new file formats. Its intermediate files are generally incompatible with earlier versions of CA. Users should not run 8.0 software against earlier pipeline files, or earlier software on 8.0 pipeline files. Users should launch CA 8.0 assemblies from scratch.

New Features

  1. To support PacBio and Moleculo reads, the default maximum read length was increased from 2047bp (AS_READ_MAX_LEN_BITS=11) to 65,535 (AS_READ_MAX_LEN_BITS=16). If you are assembling only short reads, you can decrease the maximum read length to reduce the size of the overlapper output and the resulting ovlStore.


(list not yet compiled)

Bug Fixes

(list not yet compiled)

Known Problems

  • Scaffolder (CGW) is slow
  1. Some data sets are exhibiting enormous run times in the scaffolding module.
  • Invalid Results
  1. In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard' is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end libraries.
  • Scaling
  1. Creating an overlap store for billions of fragments is a significant bottleneck. This process can take multiple days, sometimes longer than the (parallel) computation of the overlaps.
  • Algorithmic limitations
  1. There is no explicit support for high-coverage. Coverage such as 80X induces combinations of sequencing errors that confound Celera Assembler. At best this leads to higher reported rates of allelic variation. At worst this leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly.
  2. There is little support for assembly of data sets with a small ratio of mate pairs to unpaired fragments. Mate pairs are used to detect misassmblies and to form contigs and scaffolds. Too few mate pairs can result in shattered assemblies and many bases in degenerate contigs.
  3. There is no support for bar-coded 454 data. Users with bar-coded data may use some other utility to remove the bar code sequence and partition reads by bar code into separate SFF files.
  4. There is no support for data from ABI SOLiD.
  5. There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort.


Copyright 1999-2004 by Applera Corporation. Copyright 2005-2013 by the J. Craig Venter Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source and available free of charge subject to the GNU General Public License, version 2.