The Celera Assembler executive script, called runCA, is the best way to run Celera Assembler. The runCA script helps you manage the 50-or-so separate programs that make up the Celera Assembler pipeline.
The runCA script divides the computation into stages. Most stages write to one specific subdirectory. The subdirectories are numbered 0 through 9 for clarity. For example, overlaps used to generate unitigs are contained in the 1-overlapper directory.
The runCA script can help recover from a premature termination. It automatically detects an unfinshed job and restarts it at the appropriate stage. Say, for example, you run out of disk space during the 6-clonesize stage. After freeing some disk space elsewhere, you could just restart runCA. Since an out-of-disk error commonly leaves corrupt files, you could delete the 6-clonesize directory first. It is important not to delete or compress any of Celera Assembler's upstream directories (numbered 0 through 5). Also, it is important to restart runCA with an exact replicate of the original command. With that being the case, runCA will detect completion of earlier stages and resume at stage 6-clonesize.
At a minimum, runCA needs to know three things: a directory to store results, a prefix to name the result files with, and input data files. There are many options, which are, well, optional. Options can be supplied on the command line as key=value pairs, in a system-wide config file, in a user-local config file, or in an assembly-local config file. The config files are refered to as ' spec files', with full details given there.
The usage of runCA is:
runCA -d directory -p prefix -s specfile <option=value> ... <input-files> ...
- The -d directory parameter is required. All intermediate and final results are stored within this directory. runCA will create this directory if it does not already exist. Use of absoute path names (e.g., /usr/assembly/bigfoot1) is recommended. If a relative path name is used, runCA will internally convert to an absolute path. Note that NFS directories must be mounted in the same way on every host that runCA executes on for this to work -- if host1 mounts fileserver:/assembly as /work/assembly, but host2 mounts fileserver:/assembly as /local/work/assembly, runCA will fail when used with a relative path.
- The -p prefix parameter is required. Most files in the assembly directory will be named with this prefix. The final output will have names such as 'prefix.scf.fasta', 'prefix.asm', 'prefix.qc', etc.
- The -s specfile parameter is optional, but STRONGLY recommended, parameter. A specfile contains runCA option=value pairs and input data files, and is an alternative to suppling these directly on the command line. See specFiles for more details.
- option=value pairs set runCA options. Whitespace separates options. If any value contains spaces, use single-quotes or double-quotes around the value.
- input-files are Celera Assembler FRG file ("frag file"), though we provide conversions for a few other common formats.
The command line below would assemble data from four fragment files, two of which are compressed, writing intermediate files and outputs into files named 'bigfoot.*' in a directory called 'bigfoot1'. Since the directory is a relative path, it will be created in the current directory, likewise, the input files must be present in the current directory. If option values contain spaces, use single-quotes or double-quotes around the entire value.
perl $ASMBIN/runCA-OBT.pl \ -p bigfoot \ -d bigfoot1 \ useGrid=1 \ scriptOnGrid=1 \ ovlHashBits=23\ ovlHashBlockLength=30000000 \ ovlRefBlockSize=7630000 \ frgCorrBatchSize=1000000 \ frgCorrThreads=4 \ fragments1.frg \ fragments2.frg.gz \ fragments3.frg.bz2 \ fragments4.frg
(The backslash character ("\") is the line-continuation character, which lets us split long shell commands across multiple lines. Usually, you'll put everything on one line.)
An equivalent invocation using a spec file is:
perl $ASMBIN/runCA-OBT.pl \ -p bigfoot \ -d bigfoot1 \ -s bigfoot1.spec
where 'bigfoot1.spec' contains:
# Spec file for the bigfoot1 assembly. useGrid = 1 scriptOnGrid = 1 ovlHashBits=23 ovlHashBlockLength=30000000 ovlRefBlockSize=7630000 frgCorrBatchSize = 1000000 frgCorrThreads = 4 /local/assembly/bigfoot-fragments/fragments1.frg /local/assembly/bigfoot-fragments/fragments2.frg.gz /local/assembly/bigfoot-fragments/fragments3.frg.bz2 /local/assembly/bigfoot-fragments/fragments4.frg
runCA -version will report the version of the software.
% runCA -version CA version CVS TIP ($Id: AS_GKP_main.c,v 1.97 2011/08/30 02:59:31 brianwalenz Exp $). CA version CVS TIP ($Id: AS_CGB_unitigger.c,v 1.45 2011/09/06 02:15:18 mkotelbajcvi Exp $). CA version CVS TIP ($Id: BuildUnitigs.cc,v 1.87 2011/12/29 09:26:03 brianwalenz Exp $). CA version CVS TIP ($Id: AS_CGW_main.c,v 1.86 2011/12/19 02:20:06 brianwalenz Exp $). CA version CVS TIP ($Id: terminator.C,v 1.13 2011/12/21 00:52:19 brianwalenz Exp $).
runCA -options will report all options, their default values, and a bried description of the option.
runCA Best Practices
- Place all options and all input files into a spec file. This will serve as excellent documentation for what was assembled and how it was assembled.
- Use full paths to data files, both to document exactly what files were used, and to prevent the use of incorrect input files. Using a relative path can allow one to use the same runCA command and/or spec file for different assemblies, but loses tracking of what fragments were assembled. For example, suppose we have fragmnets "/work/assembly/input.frg", /work/assembly/genome1/input.frg" and "/work/assembly/genome2/input.frg". The following runCA commands produce different assemblies:
- (if run from /work/assembly)
runCA -p genome1 -d genome1 input.frg
the assembly is computed in /work/assembly/genome1/, using /work/assembly/input.frg
- (if run from /work/assembly/genome1)
runCA -p genome1 -d . ../input.frg
the assembly is computed in /work/assembly/genome1/, using /work/assembly/input.frg (the same assembly as below)
- (if run from /work/assembly/genome2)
runCA -p genome2 -d . ../input.frg
the assembly is computed in /work/assembly/genome2/, using /work/assembly/input.frg (the same assembly as above)
- (if run from /work/assembly/genome1)
runCA -p genome1 -d genome1 input.frg
the assembly is computed in /work/assembly/genome1/genome1/, using /work/assembly/genome1/input.frg
- (if run from /work/assembly)
- Create a new spec file for each assembly. It is common to try different parameters for a single dataset, and having a new spec file for each will reduce confusion as to which assembly used which parameters.
- Use the same name for the directory and the specfile.
- Use the same prefix for re-assemblies of the same data set. This makes if convenient to compare results across assemblies, as only the top-level directory changes. We know that scaffolds will always be in '*/9-terminator/bigfoot.scf.fasta'.
These suggestions were motivated by reassembly of the same data with different parameters. Instead of keeping a log of what commands were used, what directory they were executed in, where the input files were located, etc, we specify all options and data files in each spec file. Each assembly is then run with a standard and simple command ("runCA -p godzilla -d godzilla1 -s godzilla1.spec"). Since the files in each assembly directory are named the same (the same prefix is used for each assembly), comparing results is much easier (for example, "mergeqc.pl godzilla*/9-terminator/godzilla.qc").
In addition to the 'project-specific' spec file used above, both 'user-specific' and 'system-wide' default spec files are supported. The 'system-wide' spec file can contain carefully chosen values to configure the computational grid being used, and the 'user-wide' spec file can contain settings that a user is frequently using, for example, setting the error rates appropriate for a metagenomic assembly.
These options affect runCA directly, or affect every component.
- showNext=boolean (default=0) (new in CA 8.0)
- If set, the next major command will be printed to the screen instead of executed. Minor tasks, such as checking that the last stage finished and preparing for the next stage, will still execute.
- pathMap=filename (default=empty-string)
- A file containing a mapping of hostname to a directory containing the CA software installation to be used on that host. In most cases, runCA can determine the correct binaries to use, and this option is not needed. This option is useful in heterogeneous environments, in particular, with multiple versions of the same OS, or different mount points. An example file:
- Be sure to use the hostname as returned by uname -n.
- The assembly directory (-d option) must be the same across all hosts.
- shell=string (default=/bin/sh)
- Which command interpreter to run scripts with. It must be sh-compatable, for example, bash. C shells, such as csh or tcsh, will not work. A full path to the binary is required.
There are five configurable error rates. Values set by environment variables (AS_OVL_ERROR_RATE, AS_CGW_ERROR_RATE and AS_CNS_ERROR_RATE) will be replaced by values set by runCA options. Note that the overlap based trimming and unitig error rates CANNOT be set via the environment.
An 'error rate' is the fraction error in the overlap, while an 'error limit' is an absolute number of errors. Error rates must be between between 0.0 (0% error) and 0.40 (40% error; before CA 8, this was limited to 25% error). There is no limit on the error limit. An overlap is used if it is below either the 'error rate' or 'error limit' threshold. For example, an overlap of length 100 bases with 2% error has 2 mismatches. Suppose utgErrorRate=0.015 and utgErrorLimit=2.5. The overlap is used for unitigging because the number of mismatches (2) is less than the error limit (2.5).
Error rates must obey the relationship utg ≤ ovl ≤ cns ≤ cgw. Usually, ovl = cns.
- ovlErrorRate=float (default=0.06)
- Error limit on overlaps, for both trimming and assembly overlaps. Overlaps above this limit will not be detected.
- cnsErrorRate=float (default=0.06)
- Error rate for consensus. Consensus will expect to find alignments below this level, but it doesn't strictly enforce it.
- cgwErrorRate=float (default=0.10)
- Error rate for scaffolder. Scaffolder will try to merge unitigs and contigs up to this error rate.
- obtErrorRate=float (default=see below) (new in CA 8.0)
- obtErrorLimit=float (default=see below) (new in CA 8.0)
- These control the quality of overlaps used during Overlap Based Trimming. By default, error rate used for unitigging is used (utgErrorRate and utgErrorLimit for the utg and bog unitiggers; utgGraphErrorRate and utgGraphErrorLimit for the bogart unitigger). This only affects the trimming portion (finalTrim). Chimera and spur detection filter based on the ovlErrorRate.
Unitigger error rates are more complicated. Overlaps above this error rate are not used durring unitig construction. Each unitig algorithm uses a different set of error rates:
- utg uses utgErrorRate.
- bog uses utgErrorRate and utgErrorLimit.
- bogart uses utgGraphErrorRate, utgGraphErrorLimit, utgMergeErrorRate and utgMergeErrorLimit.
- utgErrorRate=float (default=0.015 for utg and 0.030 for bog)
- Overlaps below this threshold are used in the utg and bog unitiggers.
- utgErrorLimit=float (default=2.5) (new in CA 6.1)
- Overlaps below this threshold are used in the utg and bog unitiggers.
- utgGraphErrorRate=float (default=0.030) (new in CA 7.0)
- Overlaps below this threshold are used for Best Overlap Graph construction in the bogart unitigger.
- utgGraphErrorLimit=float (default=3.25) (new in CA 7.0)
- Overlaps below this threshold are used for Best Overlap Graph construction in the bogart unitigger.
- utgMergeErrorRate=float (default=0.045) (new in CA 7.0)
- Overlaps below this threshold are used for bubble popping and repeat detection in the bogart unitigger.
- utgMergeErrorLimit=float (default=5.25) (new in CA 7.0)
- Overlaps below this threshold are used for bubble popping and repeat detection in the bogart unitigger.
Minimum Fragment Length and Minimum Overlap Length
Starting in CA 7.0, the minimum fragment length and minimum overlap length can be set at run time. Fragments below the minimum length are discarded during gatekeeper, and overlaps below the minimum overlap length are not computed.
- frgMinLen=integer (default=64)
- Fragments shorter than this length are not loaded into the assembler.
- ovlMinLen=integer (default=40)
- Overlaps shorter than this length are not computed.
Stopping runCA Early
runCA can stop after certain stages are finished. There is no corresponding startBefore option because runCA requires a very specific directory layout that is both difficult to describe and difficult to recreate manually. It is, however, possible to get much the same effect using the do* options.
- stopBefore=string (default=empty-string)
- meryl Stop before computing mer histograms.
- initialTrim Stop before the OBT initial quality trim.
- deDuplication Stop before the OBT de-duplication.
- finalTrimming Stop before the OBT trim point merge.
- chimeraDetection Stop before the OBT chimera detection.
- classifyMates Stop before de-novo classification.
- unitigger Stop before unitigger.
- scaffolder Stop before the scaffolding stage starts.
- CGW Stop before the CGW program starts.
- eCR Stop before the extend clear ranges program starts. extendClearRanges is an alias for this.
- eCRPartition Stop before partitioning for extend clear ranges. extendClearRangesPartition is an alias for this.
- terminator Stop before terminator.
- stopAfter=string (default=empty-string) (new in CA 6.1)
- initialStoreBuilding Stop after the fragment and gatekeeper stores are created.
- meryl Stop after mer counts are generates.
- overlapBasedTrimming Stop after the Overlap Based Trimming algorithm has updated the clear ranges. OBT is an alias for this.
- overlapper Stop after the overlapper finishes, and the overlap store is created.
- classifyMates Stop after de-novo classification.
- unitigger Stop after unitigs are constructed, but before consensus starts.
- utgcns Stop after unitig consensus finishes; consensusAfterUnitigger is an alias for this.
- scaffolder Stop after all stages of scaffolding are finished.
- ctgcns Stop after contig consensus finishes; consensusAfterScaffolder is an alias for this.
Grid Engine Options
runCA can make use of a computational grid, using Sun Grid Engine (SGE) or IBM Platform LSF. Please see runCA and Sun Grid Engine for an in depth discussion of using runCA on SGE. Here, we only list the options that affect SGE.
- gridEngine=string (default=SGE) (new in CA 8.0)
- Select SGE or LSF as the grid engine.
SGE is off by default in runCA. This will run all stages of the pipeline on the same machine where runCA was started. Two binary switches on runCA enable four types of grid behavior:
- useGrid=0 scriptOnGrid=0 - default; run everything on the local machine.
- useGrid=1 scriptOnGrid=0 - runCA will stop and display job submission command when any parallel stage needs to be run. The user needs to manually run this stage, or submit jobs to their computational grid. When the stage is finished, runCA must be restarted for the assembly pipeline to continue.
- useGrid=1 scriptOnGrid=1 - submit runCA directly to SGE; the stages of the pipeline will automatically run in parallel on the computational grid.
- useGrid=0 scriptOnGrid=1 - submit runCA directly to SGE; all stages of the pipeline will run on a single grid host.
- useGrid=integer (default=0)
- If zero, no stage will use the grid. If non-zero, the grid will be used for stages that support it, and that are enabled. Each stage may independently decide to not use the grid.
- scriptOnGrid=integer (default=0)
- If zero, run only the parallel components on the grid. If one, submit the controlling script (aka runCA) to the grid.
- mbtOnGrid=integer (default=1) (new in CA 7.0)
- If zero, do not use the grid for mer-based trimming. This option has no effect when useGrid is off.
- ovlOnGrid=integer (default=1)
- If zero, do not use the grid for overlapping. This option has no effect when useGrid is off.
- frgCorrOnGrid=integer (default=0)
- If zero, do not use the grid. Use of this option is discouraged, unless your grid has fast access to the assembly directory.
- ovlCorrOnGrid=integer (default=0)
- If zero, do not use the grid. Use of this option is discouraged, unless your grid has fast access to the assembly directory.
- cnsOnGrid=integer (default=1)
- If zero, do not use the grid for consensus. This option has no effect when useGrid is off.
Each stage can specify a different SGE configuration, for example, requesting multiple slots for the thread-aware overlap stage, and more memory for the memory-intensive scaffolding step.
- sge=string (default=empty-string)
- string is passed to the qsub command used to submit ANY job to the grid.
- sgeName=string (default=empty-string) (new in CA 6.1)
- string is appended to the job name supplied to SGE. This allows multiple assemblies with the same assembly name (-p option) to run concurrently without blocking, e.g., multiple assemblies with name 'asm' will wait for all overlapper jobs with name 'ovl_asm' to finish.
- sgeScript=string (default=empty-string)
- string is passed to the qsub command used to submit runCA to the grid. Every stage that runs, unless explicitly submitted to the grid, is run from within runCA, in particular, unitigger and scaffolder are run here.
- sgeMerTrim=string (default=empty-string)
- string is passed to the qsub command used to submit OBT merTrim jobs to the grid.
- sgeOverlap=string (default=empty-string)
- string is passed to the qsub command used to submit overlap jobs to the grid.
- sgeMerOverlapSeed=string (default=empty-string)
- string is passed to the qsub command used to submit mer overlap seed finding (overmerry) jobs to the grid.
- sgeMerOverlapExtend=string (default=empty-string)
- string is passed to the qsub command used to submit mer overlap seed extension (olap-from-seeds) jobs to the grid.
- sgeConsensus=string (default=empty-string)
- string is passed to the qsub command used to submit consnsus jobs to the grid.
- sgeFragmentCorrection=string (default=empty-string)
- string is passed to the qsub command used to submit fragment correction jobs to the grid.
- sgeOverlapCorrection=string (default=empty-string)
- string is passed to the qsub command used to submit overlap correction jobs to the grid.
These options are specific to stages of the assembler.
- gkpFixInsertSizes=integer (default=1)
- If non-zero, gatekeeper will fix insert size estimates that have a too large or too small standard deviation. Acceptable insert sizes estimates are 0.1 * mean < std.dev. < 1/3 * mean. If the standard deviation is outside this range, it is reset to 0.1 * mean. See also computeInsertSize.
- gkpAllowInefficientStorage=integer (default=1)
- If set, allow storage of long-reads before short-reads. This is very inefficient in time and space, both disk and memory.
The Celera Assembler incorporates many different methods to vector and quality trim input fragments, and the choice of algorithm depends on the type of data.
|454||From SFF file||YES||largest||YES (linker)|
Overlap Based Trimming invokes the overlap stage, see the Classic Overlapper Options below to configure the overlapper. It is not possible to configure the overlapper differently for overlap based trimming and normal overlaps.
Overlap based trimming writes several log files:
- asm.initialTrimLog -- one line per read. Immutable reads do not get modified, and do now appear in the log. Whitespace separated list of uid,iid pair, original clear begin, end, quality trim begin and end, vector clear begin and end, final clear begin, end.
- asm.mergeLog -- one line per read. Whitespace separated list of IID, final left and right trimming. Trimming due to chimera and spur detection are not included here. All reads are reported.
- asm.chimera.report -- many lines per read. It shows the type of problem fixed, the resulting clear range, and any evidence for the change.
- vectorIntersect=filename (default=empty-string)
- The path to a file containing a list of the vector clear range for each read. Format uid vector-left vector-right, one UID per line. Coordiates are base-based. If using "format 2" fragments, this option is not necessary.
- doOverlapBasedTrimming=integer (default=1) (alias
- If non-zero, do trimming.
- doDeDuplication=integer (default=1) (new in CA 6.1)
- If non-zero, search for duplicate reads or mate-pairs in 454 fragments. Disabled if doOBT=0.
- doChimeraDetection=off or normal or aggressive (default=normal)
- Detect chimeric reads by comparison to other reads. Disabled if doOBT=0'.
- mbtBatchSize=integer (default=1000000) (new in CA 7.0)
- Process this many fragments per mer-based trimming batch.
- mbtThreads=integer (default=4) (new in CA 7.0)
- Use this many thireads per mer-based trimming process.
- mbtConcurrency=integer (default=1) (new in CA 7.0)
- Run this many mer-based trimming processes at the same time on the local machine.
- mbtIlluminaAdapter=integer (default=1) (new in CA 8.0)
- Remove Illumina adapter sequence during merTrim
- mbt454Adapter=integer (default=1) (new in CA 8.0)
- Remove 454 adapter sequence during merTrim
Overlapper performs an all-fragments against all-fragments alignment. Each pair of fragments is aligned to decide if they overlap. In effect, it is populating an array with alignment information. Overlapper is able to segment the computation on both axes of the array. The fragments along one axis are used to construct a hash-table to seed the alignments. The fragments along the other axis then query the hash-table one at a time.
For small assemblies, one can simply divide the number of fragments by the amount of parallelization one wishes to get and use that. To get 16 jobs, divide your number of fragments by 4.
For large assemblies, we suggest using a large ovlRefBlockSize, and using ovlHashBlockLength to control the number of jobs.
Three options exist to select the style of overlapper to use:
- overlapper=ovl or mer (default=ovl)
- Select which overlap stage to use.
- obtOverlapper=ovl or mer (default=ovl)
- Select an overlap stage just for the OBT (overlap-based trimming) pre-process.
- ovlOverlapper=ovl or mer (default=ovl)
- Select an overlap stage for the main (unitig construction) portion of assembly.
These option control how much memory can be used to build the overlap store, and whether intermediate files are retained after building the overlap store.
- ovlStoreMemory=integer (default=1024)
- The amount of memory, in megabytes, to use for building overlap stores. The stage called overlapStore runs after the last overlap job finishes. It collects the outputs of all the overlap jobs. It generates the ovlStore, a directory of binary files used as a database by the rest of the pipeline. The parameter affects the running time of a bucket sort, which is always performed as a single process. Use the largest value possible. To gauge the effect of this parameter, look for a file called <prefix>.ovlStore.err with a line like, "For 6006 million overlaps, in 4096MB memory, I'll put 787200 IID's (approximately 268435456 overlaps) per bucket." The same file shows the number of buckets processed so far.
- saveOverlaps=integer (default = 0)
- Use zero to have intermediate files erased after the overlap store is created. The intermediate files are quite large, even though they are zipped, and they are completely redundant with the overlap store. Use any non-zero value to retain the intermediate files. They might be useful for analyzing parameter settings, for instance.
Both the ovl and mer overlappers use seed-and-extend alignments. These parameters control what seed size to use, and how frequent a mer must be before it is labeled a repeat and ignored:
- merSize=integer (default=22)
- Sets K, the length of each K-mer. This sets the length of the seeds used by the seed & extend algorithm. This parameter is equivalent to word size in BLAST. This parameter affects the ovl overlapper, the mer overlapper and the meryl seed finder. Setting this one parameter is eqivlalent to setting two others: obtMerSize and ovlMerSize. The result of setting all three is undefined.
- obtMerSize=integer (default=22)
- Sets K, the length of each K-mer. This parameter affects the OBT pre-process only, not the main assembly. This parameter affects the ovl overlapper, the mer overlapper and the meryl seed finder.
- ovlMerSize=integer (default=22)
- Sets K, the length of each K-mer. This parameter affects unitig construction and thus the assembly; it has no effect on the OBT pre-process. This parameter affects the ovl overlapper, the mer overlapper and the meryl seed finder.
- obtMerThreshold=integer (default=auto)
- Mers with count larger than this value will not be used to seed overlaps for Overlap Based Trimming. Only for ovl overlapper. The special value 0 disables mer counting for OBT, using all mers for seeds. The 'auto' value will examine a histogram of mer counts to pick a (usually) suitable value.
- ovlMerThreshold=integer (default=auto)
- Mers with count larger than this value will not be used to seed normal overlaps. Only for ovl overlapper. The special value 0 disables mer counting for OVL, using all mers for seeds. The 'auto' value will examine a histogram of mer counts to pick a (usually) suitable value.
- merThreshold=integer (default=auto)
- Assigns one value to both obtMerThreshold and ovlMerThreshold.
- obtFrequentMers=string (default=empty-string) (new in CA 8.0)
- A path to a FASTA file of K-mers to ignore when seeding obt overlapper overlaps. Each sequence in the file must be exactly obtMerSize bases long. If supplied, meryl will not run.
- ovlFrequentMers=string (default=empty-string) (new in CA 8.0)
- A path to a FASTA file of K-mers to ignore when seeding ovl overlapper overlaps. Each sequence in the file must be exactly ovlMerSize bases long. If supplied, meryl will now run, unless mer counts are needed by, e.g., OBT merTrim.
The ovl overlapper can be restricted to operating on specific libraries. These options are esoteric and not generally useful.
- ovlHashLibrary=integer (default=0) (new in CA 8.0)
- For ovl overlaps, only load hash fragments from specified lib, 0 means all
- ovlRefLibrary=integer (default=0) (new in CA 8.0)
- For ovl overlaps, only load ref fragments from specified lib, 0 means all
- obtHashLibrary=integer (default=0) (new in CA 8.0)
- For obt overlaps, only load hash fragments from specified lib, 0 means all
- obtRefLibrary=integer (default=0) (new in CA 8.0)
- For obt overlaps, only load ref fragments from specified lib, 0 means all
- obtCheckLibrary=integer (default=1) (new in CA 8.0)
- Check that all libraries are used during obt overlaps
- ovlCheckLibrary=integer (default=1) (new in CA 8.0)
- Check that all libraries are used during ovl overlaps
The ovl overlapper is the classic overlapper for Celera Assembler. It is appropriate for most situations. It uses a classic seed-and-extend algorithm. Whereas BLAST is tuned to find homology between long sequences, this algorithm is tuned to find overlaps between Sanger-read-length sequences and specifically the overlaps that promote assembly. Unlike the mer overlapper, this stage makes no special accommodation for homopolymer run length uncertainty; nevertheless, the ovl overlapper works well on Sanger and 454 sequence.
Every overlap job launches a separate process that feeds output through gzip file compression.
- ovlThreads=integer (default=2)
- The number of compute threads to use per overlap job.
- ovlConcurrency=integer (default=1)
- When not using SGE, the number of concurrent overlap jobs to run at the same time.
- ovlHashLoad=float (default 0.75) (new in CA 7.0)
- Maximum hash table load. If set too high, table lookups are inefficient; if too low, search overhead dominates run time. The value of 0.75 has not been shown to be the value that minimizes run time.
- ovlHashBits=integer (default 22) (new in CA 7.0)
- The size of the hash table, in bits. This is a fixed size allocation, which does not change based on ovlHashBlockLength or ovlRefBlockSize.
Bits Table Size Memory 18 5,505,024 54 MB 19 11,010,048 108 MB 20 22,020,096 216 MB 21 44,040,192 432 MB 22 88,080,384 864 MB 23 176,160,768 1728 MB 24 352,321,536 3456 MB 25 704,643,072 6912 MB 26 1,409,286,144 13824 MB 27 2,818,572,288 27648 MB 28 5,637,144,576 55296 MB 29 11,274,289,152 110592 MB 30 22,548,578,304 221184 MB
- The table size is the maximum number of different k-mer sequences that the table can load. Overlapper will stop loading the table when the number of k-mer sequences reaches ovlHashLoad (75%) of this maximum value.
- ovlHashBlockLength=integer (default=100000000) (new in CA 7.0)
- Amount of sequence, in bases, to load into the hash table. Each base loaded consumes 10 bytes of memory; loading the default 100,000,000 bases will consume 1 GB of memory, in addition to that used by ovlHashBits.
- ovlRefBlockSize=integer (default=2000000)
- This directly controls the number of overlap jobs and the run time of each. Smaller values result in more jobs that each need less time to finish. If this value is too small, overhead will dominate the total time; if too large, concurrency can be degraded.
The choice for these parameters is based mostly on the hardware you are computing overlaps on.
First, how many threads? The start of the overlap job builds the hash table. This is single-threaded. If we request the maximum number of threads (and so we can run one job at a time) all but one CPU is idle while the table is being constructed. On the other hand, if we request one thread (and so we can run N jobs concurrently), each job will be allocating memory, and we might run out of memory. To make it concrete, suppose we have a machine with 8GB and 4 CPUs. By requesting 4 threads, we can run one large job but leave 3 CPUs idle. If we request 2 threads, we can run 2 jobs concurrently, but each job is limited to 4GB. Thus, how many threads to use is related to how much memory to use.
So, how much memory? The ovlHashBits and ovlHashBlockLength parameters tell us (approximately) how much memory an overlap jobs will need.
For a machine with 8GB of memory:
- ovlHashBits=25 will immediately use nearly 7 GB of that memory. Assuming 1/2 GB for operating system overhead, this leaves only 500 MB for loading sequence data, implying ovlHashBlockLength of at most 50,000,000 can be used. A hash table of this size supports loading up to 704,643,072 k-mers, but we will load, at most, 50,000,000 (one k-mer per base of sequence). These settings make no sense.
- ovlHashBits=24 will immediately use 3.5 GB, leaving 3 GB for sequence data. This implies an ovlHashBlockLength of at most 300,000,000. The hash table can load up to 352,321,536 k-mers. A perfect match!
Or is it? If we follow the simple strategy above, we will be vastly under-loading the hash table. This strategy assumed that the input sequence is unique sequence. In real data, we have many copies of the genome and the genome itself has repeats. All those extra copies will use the same hash table location. Unfortunately, adjusting for coverage and repeats is difficult to do precisely. It depends not only on the coverage and repeat content, but size of the genome and amount of sequencing error in the reads.
The overlap job log file (0-overlaptrim-overlap/#######.out and 1-overlapper/######.out) can assist in picking a correct value. It will contain sections such as:
HASH LOADING STOPPED: strings 38020 out of 38020 max. HASH LOADING STOPPED: length 15487424 out of 15487424 max. HASH LOADING STOPPED: entries 4435417 out of 66060288 max (load 5.04).
In this example, we have loaded 15,487,424 bases of sequence, yet used only 4,435,417 out of 66,060,288 hash table entries (ovlHashLoad=0.75 ovlHashBits=22). This would indicate that we can greatly increase ovlHashBlockLength (to load more sequence) or decrease ovlHashBits (to use less memory).
Another example, from a different data set:
HASH LOADING STOPPED: strings 1090874 out of 1090874 max. HASH LOADING STOPPED: length 110000038 out of 110000038 max. HASH LOADING STOPPED: entries 61668288 out of 264241152 max (load 17.50).
The first example loaded 3.5 bases per entry (15487424 / 4435417), but the second loaded only 1.8 bases per entry (110000038 / 61668288).
As a rule of thumb, setting ovlHashBlockLength to twice the number of entries available in the table seems reasonable. Keep in mind that increasing this parameter increases memory usage, and in the 8GB example above, we cannot set ovlHashBlockLength to more than 300,000,000.
Over-loading the table will not cause overlapper to fail. Overlapper itself will run multiple batches to process all the input sequence in the available hash table space.
Using the first example above as a starting point, lets decrease the hash table size greatly, from ovlHashBits=22 to ovlHashBits=18. The log shows two batches were run:
HASH LOADING STOPPED: strings 26706 out of 125000 max. HASH LOADING STOPPED: length 11393348 out of 15580000 max. HASH LOADING STOPPED: entries 4128858 out of 4128768 max (load 75.00). ### realloc Extra_Ref_Space max_extra_ref_ct = 9526028 String_Ct = 26706 Extra_String_Ct = 0 Extra_String_Subcount = 93 Read 1307 kmers to mark to skip !!! Hash table did not read all frags Read 26706 instead of 38020 Build_Hash_Index from 26707 to 38020 HASH LOADING STOPPED: strings 11314 out of 125000 max. HASH LOADING STOPPED: length 4094076 out of 15580000 max. HASH LOADING STOPPED: entries 2476450 out of 4128768 max (load 44.99).
This is showing that the first batch filled the table to capacity, and complained (not very attractively) that the hash table did not read all fragments in the input batch. This hash table is processed (against all ovlRefBlockLengthFrags) and discarded. A new hash table is created using the unprocessed input sequence. This hash table was loaded about half full, and all ovlRefBlockLengthFrags are (again) processed.
The only danger with this is that the last batch might be small, and the overlap job must process all ovlRefBlockSize fragments again, against little sequence data.
HASH LOADING STOPPED: strings 37997 out of 125000 max. HASH LOADING STOPPED: length 15480130 out of 15480000 max. HASH LOADING STOPPED: entries 4434956 out of 8257536 max (load 40.28). ### realloc Extra_Ref_Space max_extra_ref_ct = 13726511 String_Ct = 37997 Extra_String_Ct = 0 Extra_String_Subcount = 93 Read 1307 kmers to mark to skip !!! Hash table did not read all frags Read 37997 instead of 38020 Build_Hash_Index from 37998 to 38020 HASH LOADING STOPPED: strings 23 out of 125000 max. HASH LOADING STOPPED: length 7294 out of 15480000 max. HASH LOADING STOPPED: entries 6446 out of 8257536 max (load 0.06).
(This example is contrived, since I decreased ovlHashBlockLength until I got a small second batch).
CAUTION Do not set parameters too close to your available memory. Overlapper does allocate a bit more memory than we're accounting for here. Be safe and leave one or two GB free memory per job. If needed, decrease ovlHashBits by one. Performance will not be impacted significantly.
Like the classic overlapper, the mer overlapper also uses a seed-and-extend methodology. However, all seeds are found first, allowing a second pass to examine all overlaps for a given fragment. The second pass computes the first half of Fragment Error Correction.
The mer overlapper also uses Classic Overlapper options obtMerSize and ovlMerSize.
- merCompression=integer (default=1)
- If the mer overlapper is used, compress homopolymer runs to this many letters. This value applies to the meryl mer counts too. For example, ACTTTAAC with merCompression=1 would be ACTAC.
- merOverlapperThreads=integer (default=2)
- The number of compute threads to use. Usually the number of CPUs your host has.
- merOverlapperSeedBatchSize=integer (default=100000)
- The number of fragments used per batch of seed finding. The amount of memory used is directly proportional to the number of fragments. (sorry, no documentation on what that relationship is, yet).
- merOverlapperExtendBatchSize=integer (default=75000)
- The number of fragments used per batch of seed extension. The amount of memory used is directly proportional to the number of fragments. See option frgCorrBatchSize for hits, but use those numbers with caution.
- merOverlapperSeedConcurrency=integer (default=1)
- If not on the grid, run this many seed finding processes on the local machine at the same time.
- merOverlapperExtendConcurrency=integer (default=1)
- If not on the grid, run this many seed extension processes on the local machine at the same time.
When properly installed, Celera Assembler includes the Mighty Meryl software from the KMER package (http://kmer.sf.net). The Celera Assembler source code includes, in AS_MER, a scaled down version suitable only for bacterial-size genome assemblies. That module was deprecated with the release of CA version 7. Starting with CA version 7, the use of Mighty Meryl is required for all assemblies. Using the Mighty Meryl, the command 'meryl -V' returns a string like "meryl the Mighty Mer Counter version (no version)".
- merylMemory=integer (default=800)
- Amount of memory, in megabytes (MB), that meryl is allowed to use. Only applicable if compiled with kmer library. This is a per-process limit, not a per-thread limit. The limit applies to the process regardless of the number of threads.
- merylThreads=integer (default=1)
- Number of threads that meryl is allowed to use. Only if kmer is used.
Fragment Error Correction
- frgCorrBatchSize=integer (default=200000)
- The number of reads to load into core at once. Fragment error correction will then scan the entire fragment store, recomputing overlaps. As a (very) rough guide, assume about 1.3GB for 100,000 Sanger reads.
- doFragmentCorrection=integer (default=1)
- If non-zero, do fragment error correction (and, implicitly, overlap error correction).
- frgCorrThreads=integer (default=2)
- The number of threads to use for fragment error correction.
- frgCorrConcurrency=integer (default=1)
- If the grid is NOT enabled, run this many fragment correction jobs at the same time.
- ovlCorrBatchSize=integer (default=200000)
- documentation needed! 1,000,000 uses about 2.5GB memory. 400,000 uses about 750MB.
- ovlCorrConcurrency=integer (default=4)
- If the grid is NOT enabled, run this many overlap correction jobs at the same time.
- dncMPlibraries=string (default=undef)
- A comma-separated list of libraries to process using de-novo classification.
- dncBBlibraries=string (default=undef)
- A comma-separated list of libraries to use as evidence when using de-novo classification.
The unitigger module runs as step 4 of the assembly pipeline. The module uses the read and mate data plus the pair-wise overlaps calculated upstream in the pipeline. Unitigs are initial, high confidence contigs. A true unitig would have zero contradictions in the input data. In order to build large unitigs in the face of noisy data, Celera Assembler only attempts to build unitigs that have few contradictions in the input data.
- unitigger=utg or bog or bogart (default=utg)
- Celera Assembler offers three unitig modules but the pipeline can only run one. The original unitigger (UTG) is best for Sanger capillary data. The best overlap graph unitigger (BOG) is best for 454 pyrosequencing data alone or in combination with Sanger data. BOG is faster than UTG. The bogart unitigger (BOGART) is best for Illumina data alone or in combination with other data types. The default is utg, but the default is changed by flags in the input FRG files. For instance, BOG will run by default if the input FRG file contains even one LIB message with the attribute "forceBOGunitigger=1". The unitigger= flag, whether issued on the command line or in a spec file, will override the default and the FRG files.
- utgGenomeSize=integer (default=not-set)
- This option can be used to bias the labeling of unitigs as 'repeat' or 'unique'. The label is used by the downstream scaffold module. By default, effective genome size is derived by the unitigger based on read coverage in large unitigs. Then, a coverage expectation is developed based on genome size, number of reads, and average read length. Then, all unitigs are assigned a uniqueness score (the A-stat) based on actual coverage compared to expected. Use of this option is recommended only for situations where extreme coverage differences make it unlikely that the unitigger will produce a good genome size estimate. Assemblies of uniform data do not improve by the input of a 'more accurate' genome size. If supplied, smaller numbers bias unitigger towards labelling unitigs as unique, while larger numbers bias it towards labelling unitigs as repeat. The value used by unitigger, whether input or computed, can be discovered by searching for the word genome in the unitigger output. For example: grep -i genome myASM/4-unitigger/unitigger.err
- utgBubblePopping=integer (default=1)
- Zero means do not pop bubbles and one means do pop bubbles. One is the default but really there is no solid recommendation. Bubbles are alternate paths in the overlap graph. They offer an alternate set of reads for some small section of a unitig. They can be induced by diploid polymorphism, error-induced overlaps, or repeat-induced overlaps. Popping a bubble means to fold both paths into one by sequence alignment. Whether this is correct is very hard to know. The safest route is to turn off bubble popping. If the resulting assembly leaves out many reads as degenerates, and if the degenerate consensus sequences align to contigs in the assembly, then it is probably advantageous to turn bubble popping on. (If you do this, you can launch the second assembly without re-calculating overlaps. In the directory with the first run, delete the subdirectories 4-unitigger through 9-terminator. Turn on bubble popping in your spec file. The re-launch runCA exactly as before. The second run will begin at the 4-unitigger stage.) One danger of bubble popping is goofy multiple-sequence alignments. For this reason, the post-unitig consensus module is more likely to quit on bubble-popped unitigs. Bubble popping has been a UTG unitigger option since 2002 and a BOG unitigger option since CA 6.1.
- utgRecalibrateGAR=integer (default=1)
- If one, recalibrate the global fragment arrival rate based on large unitigs.
- bogBreakAtIntersections=integer (default=1)
- Break unitigs at best overlap intersections.
- bogBadMateDepth=integer (default=7)
- Split unitigs with more than this number of overlapping bad mates.
- batRebuildRepeats=integer (default=0) (new in CA 7.0)
- If non-zero, enable EXPERIMENTAL repeat rebuilding in bogart.
- batMateExtension=integer (default=0) (new in CA 7.0)
- If non-zero, enable EXPERIMENTAL mate extension in bogart.
- batMemory=integer (default=undef) (new in CA 7.0)
- A memory limit for the BOGART unitig module. Units = gigabytes (GB). Default = use all memory available on the current host. This only controls the size of the overlap graph loaded into memory; the total process footprint is slightly larger. BOGART loads one overlap per read end. Then, based on remaining available memory, it adds some amount of additional overlaps per read. The additional overlaps are used for detection of unique/repeat boundaries along paths in the graph. Thus, larger memory limits can improve accuracy of repeat detection.
- batThreads=integer (default=undef) (new in CA 8.0)
- Enable parallel processing using this many processors on a shared memory server. Portions of the unitig process (currently the Merge/Split/Join) exploit multiple threads to reduce run time. If this parameter is explicitly set to zero, the program reverts to OpenMP behavior: thread count is determined by the the unix environment variable OMP_NUM_THREADS with all processors as the default.
- doUnitigSplitting=integer (default=1> (new in CA 8.0)
- After constructing unitigs (and their consensus sequence), search for a mate pair and read coverage pattern that indicated a unitig formed with a chimeric read. As of CA 8, with contepmorary reads, this module seems to do more damage than good. Use is deprecated.
The scaffold module is called CGW (chunk graph walker). It builds contigs and scaffolds from unitigs and mate pairs. If your assembly has no mate pairs, CGW will spend a few moments reading inputs, building data structures, writing output, and output contigs that are identical to the input unitigs.
CGW periodically dumps a checkpoint file (*.ckp.*); if stopped and restarted, CGW will always start from the last valid checkpoint.
A few portions of CGW are multithreaded. For many assemblies, these threaded portions are extremely fast. There is no runCA support for changing the number of threads. The OMP_NUM_THREADS environment variable can be set to the desired number of threads.
- cgwPurgeCheckpoints=integer (default=1)
- If non-zero remove all but the final CGW checkpoint file after cgw finishes successfully.
- cgwCompressTigStore=integer (default=0) (new in CA 8.0)
- After finishing successfully, remove intermediate versions from the tigStore. This can free up substantial disk space.
- WARNING! At present, ALL previous versions are removed, including the outputs of the unitigger and consensus. If you neew to restart CGW from the beginning, unitigs and consensus must be recomputed as well.
- cgwDemoteRBP=integer (default=1)
- The value 1 enables this test while all other values disable it. When enabled, the Repeat Branch Pattern (RBP) test is applied at CGW startup to all unitigs that are not already marked repeat. It demotes some unitigs to repeat. Repeat unitigs are withheld until late the in scaffold building process, cannot be used to seed a scaffold, but can be placed in multiple scaffolds. RBP will demote a unitig if: both edges have more than one path in the scaffold graph, counting only paths to large unitigs, and requiring at least one path at each end to be supported by a sequence overlap.
- cgwUseUnitigOverlaps=integer (default=0)
- EXPERIMENTAL! Load unused overlaps from the BOG unitigger into scaffolder. A single overlap and a single mate edge will be enough evidence to join unitigs into a contig.
- cgwReloadMates=integer (default=0)
- EXPERIMENTAL! Attempt to load more mate pairs from gkpStore on restarting CGW. This has no effect if there are no new mate pairs in the gkpStore.
- astatHighBound=integer (default=5)
- Unitigs with this Astat or higher are most likely to be considered unique during scaffold formation. CGW uses unique unitigs (U-unitigs) to seed contigs. CGW withholds repeat unitigs until late in the scaffolding process, when it attempts to place them in multiple scaffold positions. Uniqueness is determined several ways but the most important is A-stat, the Celera arrival rate statistic based on local vs. global read coverage. In theory, A-stat greater than zero indicates unique. Setting this threshold higher may increase scaffold accuracy by reducing the chance of building a chimer across a true repeat. Setting this threshold lower may increase scaffold completeness by adding true unique unitigs to the scaffold construction process. Negative values are ok. The empirical distribution of A-stat values can be evaluated using the reports in the CA 5-consensus-coverage directory. This flag may not apply to short unitigs (which usually are not marked unique) or toggled unitigs (whose uniqueness was set by the user).
- astatLowBound=integer (default=1)
- Unitigs whose Astat is above this threshold have a chance to be considered unique during scaffold formation. Unitigs whose Astat is above astatLowBound will be tested for scaffold incorporation during the Throwing Rocks phases of scaffold construction.
- stoneLevel=integer (default=2)
- The aggressiveness of stone throwing in the last iteration of cgw.
- cgwMergeFilterLevel=integer (default=1) (new in CA 8.0)
- When evaluating mate pairs to decide if two scaffolds should be merged, this controls how correct the mate pairs should be before proceeding to aligning overlapping contigs. It is both a computational filter (lower numbers result in more computation) and a quality filter (higher numbers should result in more correct scaffolds at the expense of not building large scaffolds).
- 0 - no filtering, every scaffold pair with mate evidence will have a merge attempted
- 1 - the current and historical default rules
- 2 - revised rules, currently suggested, but not the default
- 5 - very strict rules
- computeInsertSize=integer (default=enabled for small assemblies)
- This controls whether to estimate mate pair insert sizes based on scratch scaffolds. Valid values are 0= skip this step and 1= do this step; no setting means do this step only if the input contains fewer than 1 million reads. The new estimates overwrite the original values in the gatekeeper store. RunCA creates a 6-clonesize directory for this step. The estimation involves a lengthy computation of scratch scaffolds. For this reason, this step is recommended for small (bacterial) assemblies only. Larger assemblies can rely on other re-estimation methods: the unitig-based estimation in 5-consensus-insert-sizes; the scaffold-based estimation embodied in each 7-CGW scaffold computation; and flags in the FRG files that disable re-estimation on a per-library basis.
- cgwDistanceSampleSize=integer (default=100)
- Do not update insert size estimates unless there are this many mates to use as evidence.
- doResolveSurrogates=integer (default=1)
- If non-zero, resolve fragments in surrogates. A surrogate is a unitig that is suspected of being a collapsed repeat. A surrogate may be placed in one scaffold location, in many scaffold locations, or not at all. When placed, a surrogate extends the consensus of the target scaffold. By default, a surrogate contributes no read coverage to the target scaffold. Yes, scaffolds can have regions with consensus above 0X coverage. The Resolve Surrogates algorithm places some surrogate reads in scaffolds. To be resolved, a read must have exactly one placement that agrees with the surrogate layout and is supported by its mate (paired end) constraint. The scaffold consensus will change to reflect the resolved reads, where possible; thus, this algorithm may resolve the variation between instances of a genomic repeat.
- doExtendClearRanges=integer (default=2)
- Do this many rounds of ExtendClearRanges. Valid values are 0, 1, and 2; typically nothing is gained past 2 rounds. This algorithm enhances coverage and closes gaps by enlarging the clear ranges of neighboring reads to the extent supported by pair-wise alignment. This is a valid operation since the two sequences, and their co-location within a scaffold, constitute independent lines of evidence. Each round of ECR creates a 7-?-ECR subdirectory.
- extendClearRangesStepSize=integer (default=5000, or 1/8 the number of scaffolds, see-below)
- Run ExtendClearRanges algorithm in batches of this many scaffolds. The default is to use the larger of 5000 or one eigthth the number of scaffolds, whichever is larger. Note that a cgw checkpoint and a gatekeeper store backup are saved for each batch; it is VERY easy to run out of disk space on a large assembly if the step size is too small.
- kickOutNonOvlContigs=integer (default=0) (New in CA 6.1)
- EXPERIMENTAL! By default, cgw will only look for overlaps between adjacent contigs when negative gap sizes (overlaps) are estimated from the paired-end information. If no overlap is found, a gap of -20 is inserted. If this option is non-zero, look for overlaps to the next contig over, skipping the immediate neighbor. If an overlap is found, the immediate neighbor contig will be removed from the current scaffold.
- doUnjiggleWhenMerging=integer (default=0) (new in CA 6.1) (renamed in CA 7.0)
- EXPERIMENTAL! When performing gap filling, contigs are moved apart if necessary to fit inserted untigs. If the contigs are moved out of overlapping range they will not be merged. If this option is non-zero, revert the contig placement to the original value and search for an overlap. If an overlap is found, the reverted positions are accepted and the contigs merged.
- cgwContigShatterWeight=integer (default=0) (new in CA 7.0)
- EXPERIMENTAL! When starting from a checkpoint, for any contig connected to its scaffold by a link with less than cgwContigShatterWeight, remove it and place it into a singleton scaffold.
- cgwMergeMissingThreshold=integer (default=0) (new in CA 7.0)
- EXPERIMENTAL! When merging scaffolds, missing mates are those mates that should fall within the merged scaffold but do not. In metagenomics, this may not be the case for a conserved region within strains as the mates are missing because they are in a different strain. This is a value between 0 and 1 to specify the percentage of missing mates (relative to good mates) to ignore. A value of -1 means ignore all missing mates when merging.
- cgwMinMergeWeight=integer (default=2) (new in CA 8.0)
- Set the minimum edge weight that CGW should consider during merge scaffolds aggressive. This applies to merge edges in the graph of contigs and mate pairs. Merge edges are connections between whole scaffolds. The edge weight is the sum of mate pairs and sequence overlaps that support the connection. For each sufficiently weighted edge, CGW will test the merge and possibly implement it. After every round of productive merging, CGW starts again looping through untested merge edges. These loops use a lower bound that diminishes when merging is unproductive. The minimum of minimums is determined by this parameter. The value 2 is appropriate for typical Sanger mate coverage. Slightly higher values may be appropriate when mate cover is very high. Higher values increase stringency and accuracy while reducing run time and output scaffold sizes.
- cgwPreserveConsensus=integer (default=0) (new in CA 8.0)
- Do not remove the contig consensus sequence at the end of scaffolder; ctgcns will be skipped (faster) but quality will be lower.
These options apply to both post-unitigger and post-scaffolder consensus.
- cnsPartitions=integer (default=128, see-below)
- The approximate number of partitions unitigger and scaffolder will generate for consensus. There will be no more than this, but likely will be fewer. The default is 128 partitions, or partitions consisting of about cnsMinFrags fragments, whichever results in fewer partitions.
- cnsMinFrags=integer (default=75000)
- The minimum number of fragments in a consensus partition.
- cnsConcurrency=integer (default=2)
- If the grid is not enabled, run this many consensus jobs at the same time.
- cnsPhasing=integer (default=0) (New in CA 6.1)
- If this option is non-zero, VAR records on contigs will be phased. If two VARs are spanned by a read and phasing is enabled, consensus will call consistent VARs as primary, even if one of them has less read support. This option is not recommended for assemblies containing next generation sequencing data or assemblies of clonal DNA samples.
- cnsReduceUnitigs=integer integer (default=100 5) (New in CA 8.0) (Removed in CA 8.2)
- Set parameters for when and how to sample down unitig coverage. By default, unitigs that are more than 100x coverage and at least 5% of the bases in reads are from non-contained reads, only the non-contained reads are used to generate the consensus sequence.
- Set this to '0 0' to always use non-contained reads, or to '100 100' to disable the sampling.
- cnsMaxCovergae=float (default=0) (New in CA 8.2)
- Limit read coverage during unitig consensus. Contained reads are randomly removed until the coverage limit is met.
- Default 0 means unlimited coverage.
- Note that uncontained reads cannot be removed (the unitig will be disconnected) and this can result in unitigs that exceed the coverage limit.
- cnsReuseUnitigs=integer (default=0) (New in CA 8.0)
- For contigs that are formed from a single unitig, do not recompute the consensus sequence in stage 8-consensus. Instead, reuse the 5-consensus result. Note that VAR records will not be computed for these contigs.
- Future use. Select which consensus stage to use.
- fakeUIDs=integer (default=0)
- If zero, use real UID's from the UID server. Otherwise, use UID's starting from this value.
- uidServer=string (default=empty-string)
- Pass this string to stages that access the UID server (currently, AS_TER/terminator and AS_OBT/dumpDistanceEstimates).
- createAGP=integer (default=0)
- If non-zero, create an AGP file for the scaffolds. As of 8/2012, the output does not conform to the NCBI AGP 2.0 spec.
- createACE=integer (default=0)
- If non-zero, create an ACE file for the scaffolds. The output file may not be acceptable to CONSED.
- createPosMap=integer (default=1)
- If non-zero, create the posmap files that map fragments, contigs, variation records, etc, with contig and scaffold coordinates.
- merQC=integer (default=0)
- If non-zero, compute a mer based QC report.
- merQCmemory=integer (default=1024)
- Amount of memory to use, in megabytes, when computing the merQC.
- merQCmerSize=integer (default=22)
- Use size k mers for the merQC.
- cleanup=none or light or heavy or aggressive (default=none)
- Remove temporary/intermediate files after the assembly finishes. Valid values are 'none' (no cleanup), 'light' (temporary files), 'heavy' (currently, same as light), 'aggressive' (everything except the output is removed).
Unitig Repeat/Unique Toggling
Celera Assembler uses Poisson statistics to categorize unitigs on a scale from unique to repetitive. Due to coverage bias or cutoff effects, this classification may occasionally mark a unique unitig as a repeat. To avoid misassembly, the repeat unitigs are not trusted in the assembly and incorrectly marking unique elements as repeats may lower assembly contiguity. Unitig Repeat/Unique Toggling allows the Celera Assembler to correct these "repeats". The default behavior when the doToggle=1 is supplied to runCA is to search for all surrogates which are placed only once and are at least 2000bp. They are toggled to be unique and re-assembled. Toggling creates a
10-toggledAsm directory and places the assembly results in the
- doToggle=integer (default=0)
- If non-zero, at the end of a successful assembly, search for placed surrogates and toggle them to be unique unitigs, then re-run the assembly starting from scaffolder.
- toggleUnitigLength=integer (default=2000)
- Minimum length for a surrogate to be toggled.
- toggleNumInstances=integer (default=1) (New in CA 6.1)
- Number of instances for a surrogate to be toggled. If 0 is specified all non-singleton unitigs over toggleUnitigLength bp are toggled to unique.
- toggleMaxDistance=integer (default = 1000) (New in CA 6.1)
- The maximum distance from a scaffold end that a surrogate unitig may appear to be considered for toggling. If a surrogate appears twice, both times within toggleMaxDistance bp of the scaffold end, the surrogate is toggled to be unique.
- toggleDoNotDemote=integer (default = 0) (New in CA 6.1)
- Strictly enforce the unique/repeat toggle classification. If this option is non-zero, cgw will never change the unique/repeat designation specified by toggling. Otherwise, demote unitigs with repeat branching patterns (see cgwDemoteRBP option in the Scaffolder section).
Inside the assembly directory, the final results will be found in the 9-terminator sub-directory.
The primary output file is the prefix.asm file. It offers a precise description of the assembly as a hierarchical data structure. It does not repeat the input read sequences, but it does contain all the generated contig and scaffold sequences. It is in a format specif to Celera Assembler. See the ASM guide and the ASM spec.
A Quality Assessment (or Quality Control) report is in the prefix.qc file. It contains over 100 statistics about the assembly.
A few other files can help a QC investigation. The Overlap Based Trimming stage generates two files of interest when quality checking an assembly. A summary of chimera and spur detection/removal/fixing is in 0-overlaptrim/prefix.chimera.summary with the gory details (including UID's) is in 0-overlaptrim/prefix.chimera.report. The before/after trimming results are in 0-overlaptrim/prefix.mergeLog; the columns are uid, iid, original clear, new clear, and a free-form text annotation of if the fragment was deleted and why.