Version 8.3 Release Notes
These are release notes for Celera Assembler version 8.3rc2, which was released on May 24th, 2015.
This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.
The source code package includes full source code (revision 4649), Makefiles, and scripts. A subset of the kmer package (http://kmer.sourceforge.net/, version r2004), used by some modules of Celera Assembler, is included. This distribution includes Jellyfish 2.0, PBUTGCNS, PBDAGCON, BLASR, and parts of the Falcon assembler.
Full documentation can be found online at http://wgs-assembler.sourceforge.net/.
Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007], haplotype separation [Levy et al. 2007; Denisov et al. 2008], a Sanger+pyrosequencing hybrid pipeline [Goldberg et al. 2006], native assembly of 454 data [Miller et al. 2008], and PacBio data [Berlin et al. 2015]. There are links to these papers, and more, in the on-line documentation (http://wgs-assembler.sourceforge.net/).
Compilation and Installation
Users can download Celera Assembler as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.
To use the source code, execute these commands on any unix-like platform:
bzip2 -dc wgs-8.3rc2.tar.bz2 | tar -xf - cd wgs-8.3rc1 cd kmer && ./configure.sh && make install && cd .. cd src && make && cd .. cd ..
To use the binary distributions, choose a platform, download that package, then unpack it with some unix command like this:
bzip2 -dc wgs-8.3rc2-*.tar.bz2 | tar -xf -
In both cases, you can run the assembler with:
Changes in CA 8.3
Celera Assembler 8.3 includes major improvements to the PBcR pipeline:
- Updated version of MHAP to 1.5. The MHAP module is now 5-fold faster than before and requires < 10K cpu hours to overlap 70X of human P6 data on an NFS filesystem.
- Significantly improved support for lower coverage data using self-correction. Assembly of as low as 20X is now possible and 30X can produce high-quality assemblies. See the S. cerevisiae plot below for a comparison between old and new results.
- Increased maximum sequence size from 64Kbp to 256Kbp.
- Fixed support for PBS/Torque in addition to LSF and SGE
- Reduced memory usage for overlapping/overlap error correction during assembly
- Added support for Oxford Nanopore assembly options
- Hybrid PBcR with Illumina or other high-accuracy data remains unchanged from CA 8.1 and is no longer being updated.
MHAP 1.5 reduces the memory usage for the index and accelerates the second-stage filter by using random subsampling. It also eliminates repeat k-mer filtering and instead uses tf-idf weighting to down-weigh repetitive k-mers. However, overlaps composed of only repetitive k-mers will still be found.
Below, we compared assembly results using CA 8.2, CA 8.3, and Falcon 0.2.1. Falcon relies on DALIGNER to find overlaps.
|Genome||Assembler||Olap h||Total h||# Ctg||Total bp||Max Contig||NG50|
|S. cerevisiae||CA 8.2||20||28||21||12,255,862||1,547,184||818,260|
|A. thaliana||CA 8.2||1,700||1,900||38||120,486,579||15,874,695||11,164,124|
|D. melanogaster||CA 8.2||890||1,060||132||143,328,915||25,750,791||20,985,587|
|H. sapiens CHM1||CA 8.2||220,000||260,000||3,434||2,828,300,545||30,045,963||5,099,387*|
|H. sapiens CHM13||CA 8.3||8,171||68,602||15,538||3,061,261,250||81,522,549||14,591,835**|
The MHAP overlapping is faster than DALIGNER for larger genomes, however, the overall runtime of CA 8.3 is slower because of the overhead of Celera Assembler data structures. In all cases, the assemblies produced by CA 8.3 are more contiguous than those from Falcon, especially in repeat regions. For example, on S. cerevisae the CA 8.3 assembly has 10 contigs which include both telomeres while the Falcon/DALIGNER assembly only has 1. CA 8.3 also outperforms CA 8.2 in both continunity and speed. As predicted by the MHAP publication, the overlapping is faster for longer sequences, with CHM13 sequenced with 27M reads with P6 chemistry taking 50% of the runtime of CHM1, sequenced with 22M reads with P5 chemistry while using the same parameters.
We assembled Falcon corrected sequences with Celera Assembler to isolate the effects of the overlapper. Both CA 8.3 and Falcon use the same consensus module with identical parameters. The assembly with Celera Assembler was perfomed with identical parameters. Thus, the difference between these assemblies is MHAP versus DALIGNER overlaps (not consensus or assembly algorithm). In all cases, the assembly with Celera Assembler improves the N50 of the assembly but the MHAP assemblies are more contiguous than both pure Falcon assemblies as well as DALIGNER/CA8.3.
Celera Assembler 8.3 is not compatible with Celera Assembler 8.2. Its intermediate files are generally incompatible with earlier versions of CA. Users should not run 8.3 software against earlier pipeline files, or earlier software on 8.3 pipeline files. Users should launch CA 8.3 assemblies from scratch.
- Fix bug that would undertrim or delete reads when the first overlap for the read was high error.
- Fix truncation of long sequences. Increase the maximum line length to 16 million in gkpstore. Reads longer than the assembler maximum (256k) will be truncated to 256k.
- Fix bug where PBcR would always use /usr/bin/perl, rely on /usr/bin/env instead.
- Fix bug making 1-overlapper crash fail when an outdated java is available. Report java version error to user instead.
- Fix bug that would cause array jobs to fail on PBS
- Fix bug that would cause errors when submitting jobs to PBS
- PBcR hybrid correction is slow
- PBcR using Illumina or other data for correction is significantly slower than self-correction (only PacBio data with at least 25X). This is because of its reliance on BLASR or a built-in module for overlapping. The self-correction mode of PBcR is recommended.
- Scaffolder (CGW) is slow
- Some data sets are exhibiting enormous run times in the scaffolding module.
- Invalid Results
- In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard' is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end libraries.
- Creating an overlap store for billions of fragments is a significant bottleneck. This process can take multiple days, sometimes longer than the (parallel) computation of the overlaps.
- Algorithmic limitations
- There is no explicit support for high-coverage. Coverage such as 80X induces combinations of sequencing errors that confound Celera Assembler. At best this leads to higher reported rates of allelic variation. At worst this leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly.
- There is little support for assembly of data sets with a small ratio of mate pairs to unpaired fragments. Mate pairs are used to detect misassmblies and to form contigs and scaffolds. Too few mate pairs can result in shattered assemblies and many bases in degenerate contigs.
- There is no support for bar-coded 454 data. Users with bar-coded data may use some other utility to remove the bar code sequence and partition reads by bar code into separate SFF files.
- There is no support for data from ABI SOLiD.
- There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort.
Copyright 1999-2004 by Applera Corporation. Copyright 2005-2013 by the J. Craig Venter Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source and available free of charge subject to the GNU General Public License, version 2.