Version 7.0 Release Notes
These are release notes for Celera Assembler version 7.0, was released on January 15th, 2012.
This distribution package provides a stable, tested, documented version of the software. The distribution is usable on most Unix-like platforms, and some platforms have pre-compiled binary distributions ready for installation.
The source code package includes full source code, Makefiles, and scripts. A subset of the kmer package (http://kmer.sourceforge.net/, version r1916), used by some modules of Celera Assembler, is included.
Full documentation can be found online at http://wgs-assembler.sourceforge.net/.
Please cite Celera Assembler in publications that refer to its algorithm or its output. The standard citation is the original paper [Myers et al. (2000) A Whole-Genome Assembly of Drosophila. Science 287 2196-2204]. More recent papers describe modifications for human genome assembly [Istrail et al. 2004; Levy et al. 2007], metagenomics assembly [Venter et al. 2004; Rusch et al. 2007], haplotype separation [Levy et al. 2007; Denisov et al. 2008], a Sanger+pyrosequencing hybrid pipeline [Goldberg et al. 2006] and native assembly of 454 data [Miller et al. 2008]. There are links to these papers, and more, in the on-line documentation (http://wgs-assembler.sourceforge.net/).
Compilation and Installation
Users can download Celera Assembler as source code or as pre-compiled binaries. The source code package needs to be compiled and installed before it can be used. The binary distributions need only be unpacked, but they are not available for all platforms.
To use the source code, execute these commands on any unix-like platform:
% bzip2 -dc wgs-7.0.tar.bz2 | tar -xf - % cd wgs-7.0 % cd kmer % gmake install % cd ../src % gmake % cd ..
To use the binary distributions, choose a platform, download that package, then unpack it with some unix command like this:
% bzip2 -dc wgs_7.9_MyPlatform.tar.bz2 | tar -xf -
Usage: an example
Here is one example of how to use Celera Assembler. We download DNA sequence from the NCBI TraceDB hosted by the National Institutes of Health. We request paired-end fragment reads generated by Sanger chemistry on a bacterial genome. We convert the data to Celera Assembler format (a *.frg file). Then we launch the assembler software.
% ftp ftp ftp.ncbi.nih.gov ftp> cd pub/TraceDB/porphyromonas_gingivalis_w83 [...] ftp> ls -l 229 Entering Extended Passive Mode (|||50055|) 150 Opening ASCII mode data connection for file list -r--r--r-- 1 ftp anonymous 803288 Feb 19 17:29 anc.porphyromonas_gingivalis_w83.001.gz -r--r--r-- 1 ftp anonymous 226576 Feb 19 17:29 clip.porphyromonas_gingivalis_w83.001.gz -r--r--r-- 1 ftp anonymous 10261474 Feb 19 17:29 fasta.porphyromonas_gingivalis_w83.001.gz -r--r--r-- 1 ftp anonymous 18039942 Feb 19 17:29 qual.porphyromonas_gingivalis_w83.001.gz -r--r--r-- 1 ftp anonymous 1143755 Feb 27 08:01 xml.porphyromonas_gingivalis_w83.001.gz 226 Transfer complete. ftp> bin ftp> prompt ftp> mget fasta* qual* xml* ftp> bye
Convert the NCBI files to files suitable as Celera Assembler input. The last trace-to-frg step takes about a minute. The 'ls' command demonstrates the expected output files.
% perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -xml xml.porphyromonas_gingivalis_w83.001.gz % perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -lib xml.porphyromonas_gingivalis_w83.001.gz WARNING: creating bogus distance for library 'T0611' WARNING: creating bogus distance for library 'T0612' WARNING: creating bogus distance for library 'T13146' WARNING: creating bogus distance for library 'T24315' porphyromonas_gingivalis_w83.001: frags=40039 links=17113 (85.48%) % perl wgs/Linux-amd64/bin/tracedb-to-frg.pl -frg xml.porphyromonas_gingivalis_w83.001.gz Found 20020 in tafrg-porphyromonas_gingivalis_w83/porphyromonas_gingivalis_w83.001.frglib. % ls -l *.frg* -rw-r--r-- 1 bri bri 550 Jul 15 21:54 porphyromonas_gingivalis_w83.1.lib.frg -rw-r--r-- 1 bri bri 19119148 Jul 15 21:55 porphyromonas_gingivalis_w83.2.001.frg.bz2 -rw-r--r-- 1 bri bri 704432 Jul 15 21:54 porphyromonas_gingivalis_w83.3.lkg.frg
Run Celera Assembler with runCA.
% time perl wgs/Linux-amd64/bin/runCA -p pging -d testassembly porphyromonas_gingivalis_w83.* [...lots of status output...] [...about four minutes later...] % ls -l testassembly/9-terminator/*fasta -rw-r--r-- 1 bri bri 2373611 Jul 15 22:07 testassembly/9-terminator/pging.ctg.fasta -rw-r--r-- 1 bri bri 39177 Jul 15 22:07 testassembly/9-terminator/pging.deg.fasta -rw-r--r-- 1 bri bri 2392028 Jul 15 22:07 testassembly/9-terminator/pging.scf.fasta -rw-r--r-- 1 bri bri 445876 Jul 15 22:07 testassembly/9-terminator/pging.singleton.fasta -rw-r--r-- 1 bri bri 2692602 Jul 15 22:07 testassembly/9-terminator/pging.utg.fasta
Changes in CA 7.0
The major changes between 6.1 and 7.0 are on this page under bug fixes or improvements. See the full list of changes, which includes bugs that were introduced and then fixed during the upgrade from 6.1 to 7.0.
Celera Assembler 7.0 uses new file formats. Its intermediate files are generally incompatible with earlier versions of CA. Users should not run 7.0 software against earlier pipeline files, or earlier software on 7.0 pipeline files. Users should launch CA 7.0 assemblies from scratch.
- bogart: A new unitig algorithm based on the Best Overlap Graph used in the BOG unitigger. The bogart algorithm uses read overlaps to detect repeat/unique junctions, and will break all junctions that are ambiguous.
- de-novo Classification: A method for distinguishing PE-pairs from MP-pairs in an Illumina mate pair (jumping library) was added to Celera Assembler before the unitig stage. For details on the algorithm, see the Pair classification within Illumina mate pair data poster presented at Cold Spring Harbor Genome Informatics, 2011.
- Unitig Consensus: The unitig consensus algorithm changed significantly to overcome three insidious errors in the fragment multialignment and resulting consensus sequence. To fix these errors, we needed to switch from a fast k-mer and bit-mask based alignment algorithm to a slow dynamic programming alignment algorithm. This greatly increased the time to compute unitig consensus. See the page on Unitig Consensus in Version 7.0 for details.
- Unitig Consensus: A new module will automatically fix unitigs that fail to generate a consensus sequence.
- Unitigs: A new program to compute the coverage statistic of each unitig in the store has been added. Previously, this statistic was computed at the end of the unitigging module. This statistic is impacted by the runCA parameter 'utgGenomeSize'. To change the parameter after unitigs were generated meant that both unitigger and consensus would need to be rerun. Now, the coverage statistic can be recomputed independently.
- Unitig Splitting: Move the mate-based chimeric unitig detection algorithm out of CGW into its own module. This resolves crashes in CGW during unitig splitting (in function 'StoreIUMStruct()').
- FASTQ Support: Celera Assembler is moving towards using FASTQ as its primary input format. At present, all data types can be loaded from FASTQ format files, though not all of our conversion utilities output FASTQ format. See the page on FASTQ Support in Version 7.0 for details.
- overlap: Replace the 'overlapper' with a version that can better handle short reads. A downside to this is that memory usage has increased. At the same time, replace options 'ovlMemory' and 'ovlHashBlockSize' with ovlHashBits' and 'ovlHashBlockLength', respectively. See the page on Configuring Overlapper in Version 7.0 for details.
- general: Add runtime support for setting the minimum fragment length and minimum overlap size.
- general: Fixes to allow reads up to 16,384bp (tested) and larger.
- general: Fixes to allow up to 2 billion reads.
- fastqToCA: Add -interleaved to fastqToCA, and support for interleaved fastq files to gatekeeper.
- sffToCA: Be more permissive when creating mate pairs. Replace existing heuristics (absolute values of errors, length) with percent identity and percent coverage thresholds for deciding when a linker alignment is good.
- gatekeeper: Add 'clearRangeHistogram', a utility or plotting a histogram of the begin/end clear range.
- gatekeeper: Add script (frg-to-fastq.pl) to convert FRGv2 (v1 not supported) to fastq.
- gatekeeper: If an Illumina read is longer than AS_READ_MAX_PACKED_LEN, store it as a 'normal' read instead.
- gatekeeper: Write a mapping from CA UID to illumina name to (gkpStore).illluminaUIDmap.
- gatekeeper: The maximum length for the space-efficient fixed-length read storage in gatekeeper is set to 136bp. Reads shorter than this will be inefficiently stored (but more efficiently than normal). Reads longer than this are stored using the original variable-length storage mechanism. The length can be changed at compile time in AS_global.h to 104bp, 136bp or 168bp.
- overlapStore: Add -F option to set the batch size based on an explicit limit on the number of batch files, instead of guessing a memory size.
- BOG: More aggressively pop intersection bubbles.
- tigStore: Add an analysis of mate pair status in unitigs and contigs. Invoked with 'tigStore -d matepair ...'.
- tigStore: Add a script to delete a unitig from the tigStore, updating gatekeeper to reflect the loss of those fragments and mates.
- tigStore: Add options -w and -s for formatting the multialign print output.
- Consensus: Allow processing of VERY deep coverage (2000x +) unitigs/contigs.
- buildPosMap: Fix crash on assembles with more than 32 million contigs.
- terminator: Make checkpoint optional; if no checkpoint is provided, only UTG records are output.
- sffToCA: A crash when '-clear pair-of-n' encountered a read starting with a pair of N's was fixed.
- sffToCA: Fix a sign flip resulting in an incorrect 'right fragment' length being computed. When a fragment has a non-zero begin clear range, the length of the right fragment was computed too small, and might result in the loss of the fragment.
- sffToCA/OBT: When sffToCA detects linker sequences, some of the signals are too strong to be ignored, but not strong enough to be reliable. These signals are marked as potential 'contaminant' and used during Overlap Based Trimming chimera detection. This marking was not being saved properly (since Oct 23 2008) and was not being used during chimera detection.
- Gatekeeper: Related to the previous, FASTQ reads were failing to properly initialize this marking, resulting in many FASTQ reads being labeled as contaminant. This was, fortunately, a short lived problem.
- Gatekeeper: Numerous problems with reads longer than the maximum allowed (2047bp) and reads of very specific lengths were discovered and fixed. All of these resulted in gatekeeper crashing.
- Gatekeeper: When dumping FASTQ, fix a bug where both mated reads were labeled 'dir=F'. Now, one is F and the other is R, though the label is arbitrary.
- Gatekeeper: Remove -E option (error log location). It is now always written to (gkpStore).errorLog.
- Gatekeeper: Convert invalid Illumina bases (like '.') into 'N' with minimum QV, and do this before trimming off low-qv ends. The previous behavior was to discard such reads.
- Gatekeeper: Change the space-efficient fragment storage to be less monolithic. This allows loading of just the fragment metadata and not the sequence data. This breaks compatibility with earlier stores and programs.
- Gatekeeper: Change the maximum length of the space-efficient fragment storage from 104 to 136 bases. Fragments longer than this are stored in the normal storage area.
- Gatekeeper: Don't fail on invalid QV's, replace them with the closest valid value. Likewise, replace invalid bases with 'N' and a low QV.
- Gatekeeper: Bug in handling of -outtie and -type. Type was used instead of outtie, resulting in reads of type SOLEXA being reverse complemented (e.g., treated as innie) and -outtie completely ignored in all cases.
- Gatekeeper: Correct errors in dumpFastA: -allbases was not printing all bases; the mate UID was not being initialized correctly, resulting in the previous mate UID reported for fragments with no mate.
- Gatekeeper: Change the default 'frg' dump format from legacy version 1 to latest version 2. Remove '-format2' option, replace with '-legacyformat'.
- Gatekeeper: Change the -dumpinfo format to include the libIID, and first and last read IID.
- Gatekeeper: Allow reading fragments from stdin.
- OBT: Remove read UID / name references from the log files.
- overlapStore: Fix crash when there are fewer overlaps than fragments.
- overlapStore: Allow any number of input overlap files. Previously limited to about 10,000.
- overlapStore: Fix integer overflow with large memory sizes when counting the number of overlaps in a batch.
- overlapStore: Fix a crash when the memory size is too small or the number of buckets is too large and we run out of open files.
- BOG: Reduce frequency of crashing during mate based splitting on hybrid Illumina assemblies. During splitting, try to split only on non-contained fragments.
- BOG: Correct a flaw in the placement of fragments. This exhibited itself only during bubble popping, and resulted in false negatives (bubbles not popped).
- BOG: When merging unitigs, assert(0) if it fails to merge. This does occur, but the fix is too expensive given that bubble popping has already been rewritten (but not tested enough to be in CVS).
- BOG: Fix crash when a unitig is larger than 1 Mbp.
- BOG: Fix crash when bubble fails to pop. Introduced on 2010/11/09.
- unitig toggling: Resolve failure with cleanup=aggressive and doToggle=1. Cleanup was occurring before toggling, which removed the data stores, and toggling failed.
- CGW: At the start of CGW, check that all fragments and mate pairs are present in the input unitigs. This catches errors (human or otherwise) made during unitig consensus.
- CGW: Resolve rare infinite loop during scaffold merging. CGW was merging two small scaffolds, one with two contigs the other with one contig. It then decided that the resulting scaffold was weakly conntected, and split it back to the original two scaffolds, yet still counted this as a successful merge. The loop would repeat until no successful merges occurred.
- contig consensus: Fix a crash when abutting unitigs. This resolves the crashes due to "Assertion 'apos < alen' failed".
- terminator: Fix problems reading large messages. This typically showed up as a crash in the post-terminator steps on large assemblies with deep unitig coverage.
- runCA: Rename intermediate overlap file names from 'h####r####' to be named after the job index. This makes it easier to identify which files are associated with a failing job.
- runCA: Merge individual pieces into a single executable. Line numbers in the installed version are now the same as in the source code version (easier debugging and error reporting).
- runCA: Stop reporting runCA options to stderr, instead, write them to runCA-logs/ in the assembly directory.
- Invalid Results
- In sffToCA, paired-end read lengths are calculated incorrectly in rare cases when '-trim hard' is used. Until the issue is resolved, users should always use '-trim chop' on 454 paired-end libraries.
- Creating an overlap store for billions of fragments is a significant bottleneck. This process can take multiple days, sometimes longer than the (parallel) computation of the overlaps.
- The unitig consensus computation is slower than in previous releases.
- The bogart unitigger has not been tuned or parallelized to run conveniently on larger assemblies.
- The BOG unitigger can crash during mate based splitting.
UnitigBreakPoints* UnitigGraph::computeMateCoverage(Unitig*, int):
Assertion `loc.bgn <= bad.end+1 || loc.end <= bad.end+1' failed.
UnitigBreakPoints* MateChecker::computeMateCoverage(Unitig*, BestOverlapGraph*, int):
Assertion `loc.bgn <= bad.end+1 || loc.end <= bad.end+1' failed
SF bug 3305845, http://sourceforge.net/tracker/?func=detail&aid=3305845&group_id=106905&atid=645639
SF bug 3050723, http://sourceforge.net/tracker/?func=detail&aid=3050723&group_id=106905&atid=645639
This cannot be easily fixed. If encountered, either disable mate based splitting, increase the threshold, or switch to the bogart unitigger.
- Algorithmic limitations
- There is no explicit support for high-coverage. Coverage such as 80X induces combinations of sequencing errors that confound Celera Assembler. At best this leads to higher reported rates of allelic variation. At worst this leads to a fractured assembly. Sampling from high-coverage reads can yield a better assembly.
- There is little support for assembly of data sets with a small ratio of mate pairs to unpaired fragments. Mate pairs are used to detect misassmblies and to form contigs and scaffolds. Too few mate pairs can result in shattered assemblies and many bases in degenerate contigs.
- There is no support for bar-coded 454 data. Users with bar-coded data may use some other utility to remove the bar code sequence and partition reads by bar code into separate SFF files.
- There is no support for data from ABI SOLiD.
- There is no support for cDNA, exon-enriched DNA, or DNA amplified with bias of any sort.
Copyright 1999-2004 by Applera Corporation. Copyright 2005-2012 by the J. Craig Venter Institute. The Celera Assembler software, also known as the wgs-assembler and CABOG, is open-source and available free of charge subject to the GNU General Public License, version 2.