Version 6.1 Release Notes

From wgs-assembler
Jump to: navigation, search

These are Release Notes for Celera Assembler version 6.1, released May 1st, 2010.

This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.

The source code package includes full source code, Makefiles, and scripts. A subset of the kmer package (http://kmer.sourceforge.net/, version r1821), used by some modules of Celera Assembler, is included.

This package was prepared by scientists at the J. Craig Venter Institute (http://www.jcvi.org/) with funding provided by the National Institutes of Health (http://www.nih.gov/).

Full documentation can be found online at http://wgs-assembly.sourceforge.net/.

Citation

Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007], haplotype separation [Levy et al. 2007; Denisov et al. 2008], a Sanger+pyrosequencing hybrid pipeline [Goldberg et al. 2006] and native assembly of 454 data [Miller et al. 2008]. There are links to these papers, and more, in the on-line documentation (http://wgs-assembler.sourceforge.net/).

Compilation and Installation

Users can download Celera Assembler as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.

To use the source code, execute these commands on any unix-like platform:

% bzip2 -dc wgs-6.1.tar.bz2 | tar -xf -
% cd wgs-6.1
% cd kmer
% sh configure.sh
% gmake install
% cd ../src
% gmake
% cd ..

To use the binary distributions, choose a platform, download that package, then unpack it with some unix command like this:

% bzip2  -dc  wgs_6.1_MyPlatform.tar.bz2  |  tar  -xf  -

Usage: an example

Here is one example of how to use Celera Assembler. . We download DNA sequence from the NCBI TraceDB hosted by the National Institutes of Health. We request paired-end fragment reads generated by Sanger chemistry on a bacterial genome. We convert the data to Celera Assembler format (a *.frg file). Then we launch the assembler software.

Please see the expanded version of this example for more details.

% ftp ftp ftp.ncbi.nih.gov
ftp> cd pub/TraceDB/porphyromonas_gingivalis_w83
[...]
ftp> ls -l
229 Entering Extended Passive Mode (|||50055|)
150 Opening ASCII mode data connection for file list
-r--r--r--   1 ftp      anonymous   803288 Feb 19 17:29 anc.porphyromonas_gingivalis_w83.001.gz
-r--r--r--   1 ftp      anonymous   226576 Feb 19 17:29 clip.porphyromonas_gingivalis_w83.001.gz
-r--r--r--   1 ftp      anonymous 10261474 Feb 19 17:29 fasta.porphyromonas_gingivalis_w83.001.gz
-r--r--r--   1 ftp      anonymous 18039942 Feb 19 17:29 qual.porphyromonas_gingivalis_w83.001.gz
-r--r--r--   1 ftp      anonymous  1143755 Feb 27 08:01 xml.porphyromonas_gingivalis_w83.001.gz
226 Transfer complete.
ftp> bin
ftp> prompt
ftp> mget fasta* qual* xml*
ftp> bye

Convert the NCBI files to files suitable as Celera Assembler input. The last trace-to-frg step takes about a minute. The 'ls' command demonstrates the expected output files.

% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz
% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz
WARNING: creating bogus distance for library 'T0611'
WARNING: creating bogus distance for library 'T0612'
WARNING: creating bogus distance for library 'T13146'
WARNING: creating bogus distance for library 'T24315'
porphyromonas_gingivalis_w83.001: frags=40039 links=17113 (85.48%)
% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz
Found 20020 in tafrg-porphyromonas_gingivalis_w83/porphyromonas_gingivalis_w83.001.frglib.
% ls -l *.frg*
-rw-r--r--  1 bri  bri       550 Jul 15 21:54 porphyromonas_gingivalis_w83.1.lib.frg
-rw-r--r--  1 bri  bri  19119148 Jul 15 21:55 porphyromonas_gingivalis_w83.2.001.frg.bz2
-rw-r--r--  1 bri  bri    704432 Jul 15 21:54 porphyromonas_gingivalis_w83.3.lkg.frg

Run Celera Assembler with runCA.

% time perl wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.*
[...lots of status output...]
[...about four minutes later...]
% ls -l testassembly/9-terminator/*fasta
-rw-r--r--  1 bri  bri  2373611 Jul 15 22:07 testassembly/9-terminator/pging.ctg.fasta
-rw-r--r--  1 bri  bri    39177 Jul 15 22:07 testassembly/9-terminator/pging.deg.fasta
-rw-r--r--  1 bri  bri  2392028 Jul 15 22:07 testassembly/9-terminator/pging.scf.fasta
-rw-r--r--  1 bri  bri   445876 Jul 15 22:07 testassembly/9-terminator/pging.singleton.fasta
-rw-r--r--  1 bri  bri  2692602 Jul 15 22:07 testassembly/9-terminator/pging.utg.fasta

Changes in CA 6.1

Backward Compatibility

Celera Assembler 6.1 uses new file formats. Its intermediate files are generally incompatible with earlier versions of CA. Users should not run 6.1 software against earlier pipeline files, or earlier software on 6.1 pipeline files. Users should launch CA 6.1 assemblies from scratch.

New Features

  1. Illumina Data: CA 6.1 introduces native support for DNA sequencing reads from the Illumina (or Solexa) platform. Celera Assembler reads three varieties of fastq files. It requires an FRG meta-data file that can be generated with the fastqToCA utility, new in CA 6.1. CA 6.1 has been validated using 100x of Illumina GA II 100bp reads, paired at 200bp insert size, from escherichia coli. The quality of support for shorter reads or larger genomes is still unknown. The minimum read and overlap lengths remain at 64bp and 40bp, respectively. (These can be adjusted at compile time.) Celera Assembler has almost no code specific to the sequencing platform. Therefore, Celera Assembler can assemble mixtures of read types including Sanger, 454, and Illumina.
  2. More Reads: Celera Assembler 4 could comfortably handle tens of millions of (Sanger) reads. Celera Assembler 5 could comfortably handle handled hundreds of millions of (454) reads. Celera Assembler 6 should be able to handle up to a billion reads. (This claim has not yet been verified by the regulatory agencies.)
  3. Duplicate 454 Reads: Previous versions of CA removed 454 reads that were a prefix of some other 454 read, but only if the prefixes were identical. CA 6.1 allows errors in the prefix match. Additionally, it detects and removes redundant mate-pairs apparently sequenced from the same molecule.
  4. Consensus: CA 6.1 experiences many fewer consensus failures than Celera Assembler 5. The consensus code is more reliable due to extensive refactoring, problem discovery, and improvement. A new, last-ditch consensus algorithm is applied in the cases where all others fail.
  5. Phasing: By user request, there is a new option to control the phasing of variants. With phasing off, the majority allele at every variant column is promoted to the consensus. With phasing on, the choice of majority allele could reflect a group of neighbor variant columns. As a result, the minority base at some columns could get promoted to the consensus. Phasing was deemed unacceptable for some applications. Phasing was always enabled in Celera Assembler 5; it is disabled by default in Celera Assembler 6. See the cnsPhasing option.
  6. Closure/Finishing: CA 6.1 offers additional support for finishing. This refers to improving the quality of an assembly by analysis, manipulation, targeted sequencing, and re-assembly. The pipeline now accepts closure reads and their placement constraints. Specifically, the input FRG format has a new PLC placement message type. The assembler considers these constraints, not as absolutes, but on par with other information. It uses them, not only to close gaps, but throughout the pipeline. The constraints could be derived from PCR primer sites and apply to the corresponding PCR reads, for example.
  7. Pipeline Engineering: The traditional Celera Assembler used its own message passing interface wherein information passed from module to module in large (text or binary) files. Over the years, Celera Assembler has shifted to the use of its own database structure, called Store. Previous versions of Celera Assembler introduced the read database (gkpStore) and the overlap database (ovlStore). Celera Assembler 6.1 introduces a database for unitigs and contigs (tigStore) to replace the *.cgb message files. This change reduces I/O. It also facilitates debugging and mid-course alterations to problematic assemblies. There are utilities for dumping, querying, and updating the database.

Improvements

  1. Unitig Toggling: The unitig toggling process is supported directly from runCA. The algorithm now recognizes and toggles an additional case: a false-repeat unitig preventing the join of two scaffolds by being placed on the end of one scaffold and the beginning of the other.
  2. Scaffolding: The runCA executive has new options that may improve scaffolding. One ejects a contig from a scaffold when it no longer has overlaps to its neighbors. Another shrinks a gap to its original size if it had been enlarged to fit unitigs (rocks & stones) but the unitigs didn't fit.
  3. Long 454 Reads: Very long bad reads, sometimes generated on 454 platforms, are correctly trimmed.
  4. Disk Usage: The intermediate output files from the overlap stage are now removed after the corresponding overlap store is created.
  5. The runCA executive lost several options:
    • bogEjectUnhappyContain was removed. It is always enabled now.
    • cgwOutputIntermediate was removed. The need for this option was removed by allowing terminator to read input directly from the tigStore and a CGW checkpoint file.
  6. Several existing options were renamed:
  7. The runCA executive has several new options:

Bug Fixes

The major changes between 5.4 and 6.1 are on this page under bug fixes or improvements. See the full list of changes, which includes bugs that were introduced and then fixed during the upgrade from 5.4 to 6.1.

  1. Scaffolding: The CGW scaffold module has a major bug fix. This fix changes the number of scaffolds generated on some data. Users may be disappointed to see MORE scaffolds in a 6.1 assembly of the same data, compared to a 5.4 assembly. The bug was present in CA 5.4 and possibly all earlier versions. The bug was in code that tested scaffold merges before implementing them. The buggy code returned true unconditionally. This encouraged CA to merge scaffolds even if the merger would leave more unsatisfied mate constraints. Other checks in the code (e.g. sequence alignments) seem to have prevented reckless chimerism. After the fix, the code at first rejected many scaffold merges that could be confirmed by comparison to reference. Therefore, we re-tuned the fixed code so it would approve more merges. Tuning involved redefinition of which mate constraints should be counted toward and against a given merge. In tests, the new (fixed and tuned) code did break scaffolds where the old code merged. Some breaks did fix chimera but others seemed unnecessary. Although the new (fixed and tuned) code may generate more scaffolds, it does improve assemblies by these quality metrics: more satisfied mate constraints and fewer unsatisfied mate constraints.
  2. Selection of Unitigger: runCA automatically selects a unitig module (UTG or BOG) based on the input fragment types. Previously, in some cases, it was not possible to override this selection. Now users can control the unitig module selection.
  3. Counting of Good Mates: In the QC report, reads placed in a surrogate unitig are now correctly counted as "Good Mate" if the mate pair is satisfied. Previously, these reads were always counted as "Reads with Surrogate Mate".

Known Problems

These may be fixed in future releases.

  1. In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard' is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end libraries.
  2. There is no explicit support for the high-coverage induced by XLR runs on small genomes. Coverage such as 80X induces combinations of sequencing errors that confound Celera Assembler. At best this leads to higher reported rates of allelic variation. At worst this leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly.
  3. There is little support for assembly of data sets with a small ratio of mate pairs to unpaired fragments. This ratio happens when XLR is applied to many fragment libraries but few paired-end libraries. This ratio impedes the assembly of unitigs into contigs and scaffolds. It leads to fewer bases in scaffolds and more bases in degenerate contigs.
  4. There is no support for bar-coded 454 data. Users with bar-coded data may use some other utility to remove the bar code sequence and partition reads by bar code into separate SFF files.
  5. There is no support for data from ABI SOLiD.
  6. There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort.

Legal

Copyright 1999-2004 by Applera Corporation. Copyright 2005-2009 by the J. Craig Venter Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source and available free of charge subject to the GNU General Public License, version 2.