Version 6.1 Release Errata

From wgs-assembler
Jump to: navigation, search

This page lists issues discovered after the release of Celera Assembler version 6.1.

May 2010

Pre-compiled binary for Darwin incompatible with Mac OSX 10.5
Mac users running this older OS may receive an error like "dyld: unknown required load command 0x80000022" when trying to run any binaries from the "Darwin-i386" release package. While we investigate, the recommended solution is to download the source code and build the binaries on your OS. This works, even on pre-Intel versions of Mac.
Compiling KMER with Intel compiler
The Celera Assembler depends on source from the KMER package (kmer.sf.net). A user was unable to compile with an Intel compiler because the KMER build script defaulted to the gcc compiler. A recent update to kmer/configure.sh fixes this problem. To get the fix, users must check out the latest from KMER.
Gatekeeper cannot find the FASTQ file
Users are reporting this failure when they supply a relative path to the fastqToCA utility. Users should supply an absolute path, since gatekeeper may not run from the same directory as fastqToCA.
Out of memory
Two modules of Celera Assembler are RAM-hungry. Both load an entire graph into RAM. The modules that compute unitigs (BOG and UTG) load the read+overlap graph. The module that computes scaffolds (CGW) loads the unitig+mate graph. At present there is no work-around to insufficient RAM. By way of example, one user reports BOG is swapping running on 16Gbp of Illumina sequence inside 16GB of RAM.
ASM to ACE converter?
The distribution packages include a utility, ca2ace.pl, for converting the CA ASM output file to an ACE database. This utility works on trivial examples, but it may not work on complex data. In particular, it does not resolve the issue of 0X regions on CA assemblies. These are regions with a "placed surrogate" consensus, where a repeat unitig could be placed even though its individual reads could not be placed. Users are advised to try 3rd party tools like amos or asm2ace.
Invalid gap sizes in scaffolds
CA 6.1 has a bug, first noticed in CA 5.4, that generates intrascaffold gaps with invalid characteristics. A large assembly may have a few gaps longer than any input mate pairs; presumably the gap failed to shrink after a contig was excised. Some gaps may have a large negative length; -20bp should be the limit. Finally, some gaps may have invalid standard deviations on their gap length estimate, such as NAN (not-a-number).
CGW long running time
We have seen long run times (weeks) by CGW on some genomes but not others. CGW is the scaffold module. CGW is a single-core, non-parallel computation. Run time is a function of the lengths of linked contigs that need to be tested for mergers. CGW tries hard to align contigs whose shared mates indicate an overlap. If a fast aligner fails, CGW tries a Smith-Waterman alignment. Thus, run time can be quadratic in the length of the linked contigs that don't align well. We may offer a patch to adjust this functionality. Note that run time is a function of genome complexity not read count. Patient users may want to delete older checkpoint (7-*/*.ckp) files to prevent out-of-disk. Adventurous users can force the process to end, touch a "cgw.success" file in the last CGW directory, and re-launch runCA. The "terminator" process will use the scaffolds in the last checkpoint file.
Gatekeeper fails on fastq with reads longer than 104bp
Celera Assembler 6.1 cannot parse fastq with reads longer than 104bp. The error is SIGNAL 11, segmentation fault. We will try to release a fix soon.
Gatekeeper fails on reads longer than 2047 bases
Celera Assembler 6.1 can fail to parse FRG files that contain a read with more than 2047 bases. The error is SIGNAL 11, segmentation fault. We will try to release a fix soon.

June 2010

utgErrorLimit
This runCA parameter is new as of runCA CA 6.1. Zero is the default value. It should be set to 0 for pyrosequencing (454) reads. It should be set to 2.5 for short (Illumina) reads. This was not clarified in the release notes or the initial on-line documentation. A future version of the software according to library type. Users of the current software should set it explicitly in their runCA spec files.
asmOutputFasta
This program converts ASM to FASTA in the 9-terminator stage of assembly. It has been observed to fail, leaving an ASM but no FASTA files. An associated *.err file may say "ERROR: Expecting end of message." The problem is provoked by very large and deep multiple sequence alignments. The root cause is a buffer overflow in the Celera Assembler I/O subsystem. A work-around involves modifying the source code. Increase the value of AS_MSG_globals->msgMax in AS_MSG_pmesg.c and recompile. The memory footprint will increase.
fakeUIDs
This runCA option has no effect in CA 6.1. When fakeUIDs=0 (default), Celera Assembler is supposed to use an external source for the IDs assigned to contigs and scaffolds. Celera Assembler is supposed to get each ID from a URL. This functionality allowed institutions to instantiate their own ID server and thus guarantee enterprise-wide uniqueness of IDs. When fakeUIDs=1 (or any non-zero value), Celera Assembler uses sequential integers starting with the maximum ID already assigned (to libraries, for instance). Since this bug persisted so long, and since it was discovered in code review and not in the field, this functionality may not be repaired. The parameter may be discontinued in a forthcoming release. File a report if you need it!

July 2010

BOG unitig crash
The BOG unitig module can crash during its final stage. This stage uses mate pair constraints to split some unitigs, under the control of the bogBadMateDepth parameter. This has been observed on mixtures of Illumina and Sanger reads from large genomes. The error messages say "ERROR: Failed with signal ABRT (6)" and "MateChecker::computeMateCoverage(Unitig*, BestOverlapGraph*, int): Assertion `loc.bgn <= bad.end+1 || loc.end <= bad.end+1' failed." There is a partial fix in CVS as of July 15, so adventurous users should use the latest source code. The fix uses better logic to specify the split point given bad mates among small reads contained in larger reads. With the fix, assemblies typically get through unitigs only to have trouble in post-unitig consensus. (See SF pages on fixing these.) Thus, the underlying problem has not yet been solved.