Version 8.3 Release Notes

From wgs-assembler
Jump to: navigation, search

These are release notes for Celera Assembler version 8.3rc2, which was released on May 24th, 2015.

This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.

The source code package includes full source code (revision 4649), Makefiles, and scripts. A subset of the kmer package (http://kmer.sourceforge.net/, version r2004), used by some modules of Celera Assembler, is included. This distribution includes Jellyfish 2.0, PBUTGCNS, PBDAGCON, BLASR, and parts of the Falcon assembler.

Full documentation can be found online at http://wgs-assembler.sourceforge.net/.

Citation

Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007], haplotype separation [Levy et al. 2007; Denisov et al. 2008], a Sanger+pyrosequencing hybrid pipeline [Goldberg et al. 2006], native assembly of 454 data [Miller et al. 2008], and PacBio data [Berlin et al. 2015]. There are links to these papers, and more, in the on-line documentation (http://wgs-assembler.sourceforge.net/).

Compilation and Installation

Users can download Celera Assembler as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.

To use the source code, execute these commands on any unix-like platform:

bzip2 -dc wgs-8.3rc2.tar.bz2 | tar -xf -
cd wgs-8.3rc1
cd kmer && ./configure.sh && make install && cd ..
cd src && make && cd ..
cd ..

To use the binary distributions, choose a platform, download that package, then unpack it with some unix command like this:

bzip2 -dc wgs-8.3rc2-*.tar.bz2 | tar -xf -

In both cases, you can run the assembler with:

wgs-8.3rc2/*/bin/runCA

Changes in CA 8.3

Celera Assembler 8.3 includes major improvements to the PBcR pipeline:

  • Updated version of MHAP to 1.5. The MHAP module is now 5-fold faster than before and requires < 10K cpu hours to overlap 70X of human P6 data on an NFS filesystem.
  • Significantly improved support for lower coverage data using self-correction. Assembly of as low as 20X is now possible and 30X can produce high-quality assemblies. See the S. cerevisiae plot below for a comparison between old and new results.
Contig NG50 (A) and NG10 (B) lengths are given for PBcR assemblies of S. cerevisiae at various sequencing coverage (15–120X) using BLASR, MHAP sensitive in CA 8.2, and a beta version of CA 8.3. For this example, there is benefit from increasing overlap sensitivity at lower coverage. The MHAP sensitive parameters outperform the BLASR assembly at lower coverage. CA 8.3 automatically switches to MHAP sensitive below 50X and fast above. Below 30X, the plot shows results using the recommended low coverage parameters, which must be enabled by the user and are not turned on by default. The dashed line is an Illumina Miseq 2x300 @450bp assembly from (Lee et. al. 2014). The ECTools result is also from Lee et. al. using the Illumina MiSeq assembly for PacBio correction (Lee et. al. 2014).

  • Increased maximum sequence size from 64Kbp to 256Kbp.
  • Fixed support for PBS/Torque in addition to LSF and SGE
  • Reduced memory usage for overlapping/overlap error correction during assembly
  • Added support for Oxford Nanopore assembly options
  • Hybrid PBcR with Illumina or other high-accuracy data remains unchanged from CA 8.1 and is no longer being updated.

MHAP 1.5

MHAP 1.5 reduces the memory usage for the index and accelerates the second-stage filter by using random subsampling. It also eliminates repeat k-mer filtering and instead uses tf-idf weighting to down-weigh repetitive k-mers. However, overlaps composed of only repetitive k-mers will still be found.

Assembly Comparison

Below, we compared assembly results using CA 8.2, CA 8.3, and Falcon 0.2.1. Falcon relies on DALIGNER to find overlaps.

Comparison of CA 8.2, 8.3, and Falcon assembly results.The genome sizes were estimated from the reference to be 12,157,105 for S. cerevisiae W303; 119,482,035 for A. thaliana Ler-0; 129,663,327 for D. melanogaster ISO1; and 3,101,804,741 for human. For consistency with the MHAP publication, CA 8.2 results exclude contigs with less than 50 reads. CA 8.2 ran on a faster machine than CA 8.3 and Falcon, thus the speedup for 8.3 is greater than shown by the timings. For example, on the same machine as CA 8.2, D. melanogaster takes only 180 h for overlapping and 368 h for assembly (548 h total). For CA 8.3 and Falcon all contigs are included in the statistics above. All tests except CHM1 and CHM13 ran on a single machine with an NFS filesystem. CHM1 and CHM13 ran on a cluster with a shared NFS filesystem. Olap h: CPU hours to compute initial overlaps using DALIGNER or MHAP. Contig h: CPU hours to correct reads and assemble contigs after initial overlapping. * The Falcon results use the assembly size of 2.85Gbp to compute N50, not the 3.1Gbp genome size, thus the CA 8.2 and 8.3 results are computed using the same genome size for consistency. If the genome size of 3.1Gbp was used instead, the NG50 would be 4,325,099 and 6,063,688 for CA 8.2 and 8.3, respectively. ** The DNA nexus results use the assembly size of 2.81GBp to compute N50, not the 3.1GBp genome size, thus the CA 8.3 results are computed using the same genome size for consistency. If the genome size of 3.1Gbp was used instead, the NG50 for CA 8.3 is 12,478,805. *** The CPU time is reported by DNANexus but no system details are provided so it is unknown if the CPU is comparable to one used for CA 8.3.

Genome Assembler Olap h Total h # Ctg Total bp Max Contig NG50
S. cerevisiae CA 8.2 20 28 21 12,255,862 1,547,184 818,260
CA 8.3 13 24 36 12,538,158 1,531,780 818,126
Falcon 6 14 60 12,303,083 1,515,980 732,525
DALIGN/CA 8.3 6 17 44 12,783,793 1,545,148 818,599
A. thaliana CA 8.2 1,700 1,900 38 120,486,579 15,874,695 11,164,124
CA 8.3 450 1,243 786 136,610,432 15,700,016 9,587,434
Falcon 606 796 320 122,322,212 15,916,027 7,266,254
DALIGN/CA 8.3 606 1,034 1,036 143,307,976 14,935,229 9,319,878
D. melanogaster CA 8.2 890 1,060 132 143,328,915 25,750,791 20,985,587
CA 8.3 230 713 977 167,426,208 27,033,832 21,688,913
Falcon 342 677 400 154,356,103 23,477,403 8,102,012
DALIGN/CA 8.3 342 700 1,717 189,706,560 22,286,119 15,145,131
H. sapiens CHM1 CA 8.2 220,000 260,000 3,434 2,828,300,545 30,045,963 5,099,387*
CA 8.3 19,700 61,200 17,776 3,041,936,246 35,487,175 7,154,352*
Falcon N/A N/A 5,528 2,818,296,359 N/A 5,460,023
H. sapiens CHM13 CA 8.3 8,171 68,602 15,538 3,061,261,250 81,522,549 14,591,835**
Falcon DNAnexus N/A 43,500*** 2,203 2,809,672,639 53,079,926 11,090,487

The MHAP overlapping is faster than DALIGNER for larger genomes, however, the overall runtime of CA 8.3 is slower because of the overhead of Celera Assembler data structures. In all cases, the assemblies produced by CA 8.3 are more contiguous than those from Falcon, especially in repeat regions. For example, on S. cerevisae the CA 8.3 assembly has 10 contigs which include both telomeres while the Falcon/DALIGNER assembly only has 1. CA 8.3 also outperforms CA 8.2 in both continunity and speed. As predicted by the MHAP publication, the overlapping is faster for longer sequences, with CHM13 sequenced with 27M reads with P6 chemistry taking 50% of the runtime of CHM1, sequenced with 22M reads with P5 chemistry while using the same parameters.

We assembled Falcon corrected sequences with Celera Assembler to isolate the effects of the overlapper. Both CA 8.3 and Falcon use the same consensus module with identical parameters. The assembly with Celera Assembler was perfomed with identical parameters. Thus, the difference between these assemblies is MHAP versus DALIGNER overlaps (not consensus or assembly algorithm). In all cases, the assembly with Celera Assembler improves the N50 of the assembly but the MHAP assemblies are more contiguous than both pure Falcon assemblies as well as DALIGNER/CA8.3.

Backward Compatibility

Celera Assembler 8.3 is not compatible with Celera Assembler 8.2. Its intermediate files are generally incompatible with earlier versions of CA. Users should not run 8.3 software against earlier pipeline files, or earlier software on 8.3 pipeline files. Users should launch CA 8.3 assemblies from scratch.

Bug Fixes

  • Fix bug that would undertrim or delete reads when the first overlap for the read was high error.
  • Fix truncation of long sequences. Increase the maximum line length to 16 million in gkpstore. Reads longer than the assembler maximum (256k) will be truncated to 256k.
  • Fix bug where PBcR would always use /usr/bin/perl, rely on /usr/bin/env instead.
  • Fix bug making 1-overlapper crash fail when an outdated java is available. Report java version error to user instead.
  • Fix bug that would cause array jobs to fail on PBS
  • Fix bug that would cause errors when submitting jobs to PBS

Known Problems

  • PBcR hybrid correction is slow
  1. PBcR using Illumina or other data for correction is significantly slower than self-correction (only PacBio data with at least 25X). This is because of its reliance on BLASR or a built-in module for overlapping. The self-correction mode of PBcR is recommended.
  • Scaffolder (CGW) is slow
  1. Some data sets are exhibiting enormous run times in the scaffolding module.
  • Invalid Results
  1. In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard' is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end libraries.
  • Scaling
  1. Creating an overlap store for billions of fragments is a significant bottleneck. This process can take multiple days, sometimes longer than the (parallel) computation of the overlaps.
  • Algorithmic limitations
  1. There is no explicit support for high-coverage. Coverage such as 80X induces combinations of sequencing errors that confound Celera Assembler. At best this leads to higher reported rates of allelic variation. At worst this leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly.
  2. There is little support for assembly of data sets with a small ratio of mate pairs to unpaired fragments. Mate pairs are used to detect misassmblies and to form contigs and scaffolds. Too few mate pairs can result in shattered assemblies and many bases in degenerate contigs.
  3. There is no support for bar-coded 454 data. Users with bar-coded data may use some other utility to remove the bar code sequence and partition reads by bar code into separate SFF files.
  4. There is no support for data from ABI SOLiD.
  5. There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort.

Legal

Copyright 1999-2004 by Applera Corporation. Copyright 2005-2013 by the J. Craig Venter Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source and available free of charge subject to the GNU General Public License, version 2.