Utilities

From wgs-assembler
Jump to: navigation, search

The Celera Assembler includes several utility programs. These are found in the wgs/platform/bin directory after a build (where wgs is the CA top-level directory, and platform is something like Linux.)

fragmentDepth

Generate a histogram of read depth per sliding window along a scaffold. Input file is one of the Celera Assembler output files: the sorted list of read-start postions called <prefix>.posmap.frgscf.

usage: ./fragmentDepth MODE [-min N] [-max N] [-stepSize N] < x.posmap.frgscf

  -min N     use scaffolds at least N bases long.
  -max N     use scaffolds at most N bases long.

MODES:  -histogram, -scaffold or -depth

The default mode is to compute a histogram of the number of bases at some
depth of coverage.

The -scaffold mode reports the mode, mean, median depth per scaffold.  The
-stepSize option will compute those stats, in blocks of N bases (e.g., for bases
0 through N, then N through 2N, then 2N through 3N, etc.)

The -depth mode writes a multi-fasta file with the actual depth at each base
encoded.  The encoding is somewhat complicated to avoid using the '>' letter.
Depth 0 through 9 is encoded as '0' through '9'.  Depth 10 through 68 is
encoded as A-Z[\]^_`a-z{|}, and depth more than 68 is encoded as ~.  Decode as:
  depth = letter - '0';
  if (depth > 9)
    depth -= 7;

!!WARNING -- The input frgscf MUST be sorted by scaffold ID -- WARNING!!

computeCoverageStat

Calculate coverage statistics that Celera Assembler uses to label a unitig as unique or collapsed repeat. The per-unitig statistics are: read arrival rate; average read coverage; the A-stat. The global statistics are: global read coverage; predicted genome size.

usage: ./computeCoverageStat -g gkpStore -t tigStore version

  -g <G>     Mandatory, path G to a gkpStore directory.
  -t <T> <v> Mandatory, path T to a tigStore, and version V.
  -s <S>     Optional, assume genome size S.
  -o <name>  Recommended, prefix for output files.
  -n         Do not update the tigStore (default = do update).
  -u         Skip the N50-based computation.

The inputs are two of Celera Assembler's internal databases: G="<your_project>.gkpStore" and T="<your_project>.tigStore". The outputs include a report on global statistics including genome size. The outputs also include a report on the new A-stat value for each unitig. Optionally, the utility updates the unitigs in the tigStore database.

The required tigStore version is a positive integer corresponding to filenames within the tigStore directory. Version 1 refers to initial unitigs before splitting and before consensus. The highest number found in the tigStore directory corresponds to the latest version of contigs in scaffolds.

The -s parameter sets genome size explicitly. When set, the given genome size, rather than the computed genome size, becomes the basis for setting A-stat on all unitigs.

The output is several files sharing the filename prefix specified by the -o parameter.

  • <prefix>.cga.0 is Celera Assembler's historical report of stats binned by unitig size and read count.
  • <prefix>.log is a list of statistics, one line per unitig.
  • <prefix>.stats is a short list of genome-wide statistics.

The statistics include...

  • Rho = In one unitig, the length from start of first read to start of last read.
  • sumRho = The sum of rho overal all unitigs.
  • Arrival rate = Average number of reads starting at each base. (Reciprocal of number of bases between read starts.)
  • RandomFrags = Number of reads in unitigs, minus reads marked non-random in gkpStore.
  • Supplied genome size = Echo the genome size parameter, if user supplied one.
  • Computed genome size = Extrapolation based on observed coverage and number of reads.

The utility computes the arrival rate (arrival rate = number of read starts / number of bases in unitigs). In all cases, it shows the estimated genome size based on the arrival rate (genome size = total reads / arrival rate). The number of reads is adjusted to reflect only reads from random libraries and to omit the last read in every unitig. The number of bases is adjusted to omit the contribution from the last read in every unitig.

The utility encodes several models. It runs them in order and uses the last one that worked.

  1. Global method: compute the arrival rate across all unitigs and all reads.
  2. N50 method: compute the arrival rate across unitigs longer than the N50. The N50 is based on unitig span. This is new (July 18 2012).
  3. Big unitig method: use unitigs longer than 10Kbp if those unitigs cover at least half the sequence (e.g. if N50 >= 10Kbp). Split these unitigs into 10Kbp spans and measure arrival rate on all such spans. Use heuristics to analyze the distribution of arrival rates and choose a representative.

Known Issues

July 18 2012 utgGenomeSize ignored: The runCA parameter utgGenomeSize should override the genome size estimate based on arrival rate. The parameter may have been ignored in code versions since CA 7; this is fixed.

July 18 2012 poor genome size estimate: Poor genome size estimates have been seen lately. One solution is to try the new version which uses the N50 method, rather than the global method, when unitigs are small. (As before, the utility uses the big unitig method only when unitig N50 > 10Kbp.) A work-around is to set the runCA parameter utgGenomeSize using the version in the CVS tip.

July 18 2012 crash: The utility could crash when run stand-alone outside of runCA under certain circumstances. Setting the -n parameter (meaning: do not update tigStore) could lead to a crash; this is fixed. In incorrect tigStore version can lead to a crash with a misleading message, yet choosing the correct value can be tricky; not yet fixed.