# Celera Assembler Theory

# Repeat Detection

Celera Assembler discriminates unique unitigs (U-unitigs) from repeat unitigs. The former are promoted to contigs and thereby seed scaffolds. The latter are withheld till late in scaffold formation, when they are placed in zero to many scaffold locations.

Celera Assembler uses the following repeat detection methods:

- Overlap patterns. The bogart unitig module declares unitigs as unique or repeat based on patterns of overlaps to reads that are not in the unitig. Piles of such overlaps indicate unique/repeat boundaries. Within such piles, asymmetry of unaligned read ends is used to indicate which side of a boundary is unique.
- Length. Short unitigs are not trusted to be unique.
- Coverage statistics. CA uses arrival rates (read starts per consensus base) to asses coverage. The A-stat compares the unitig's arrival rate to the global arrival rate, as inferred from the largest unitigs. When local arrival rate is significantly higher than the global arrival rate, the unitig is considered a repeat.
- Microhet statistics. CA calculates whether the unitig is likely to collapse repeat copies based on rates of conflicting base calls within aligned columns. The name refers to SNP detection or heterozygosity detection, which is similar. As of summer 2012, this method is disabled by default.
- Graph topology. RBP stands for Repeat Branch Pattern. RBP examines the local graph of this unitig and all the unitigs connected to it by sequence overlaps.
- User intervention. Users can apply a Force-Unique label or a Force-Repeat label to unitigs by ID or by criteria such as size. One tool for doing so is the runCA doToggle option.

## Coverage-based repeat discrimination

Read coverage is one method of detecting a collapsed repeat during assembly. (Another uses graph topology or network flow.) For every unitig it assembles, Celera Assembler applies a statistic based on read coverage. The A-stat was described in footnote 24 of the Drosophila assembly paper in Science Myers Sutton et al 2000:

Suppose there are F fragments in a database and the genome size is estimated as G. For a unitig with k fragments and distance ρ between the start of its first fragment and the start of its last fragment, the probability of seeing the k − 1 start points in the interval of length ρ, given the unitig is not oversampled, is [(ρF/G) k/k!]exp(−ρF/G). If the unitig was the result of collapsing two repeats, then the probability is [(2ρF/G) k/k!]exp(−2ρF/G). The log of the ratio of these two probabilities, (log e)ρF/G − (log 2) k, is our A statistic.

A-stat is the log of the ratio of two probabilities.

A-stat (unitig X) = log [ Prob (X is single-copy | coverage (X)) / Prob (X is two-copy | coverage (X)) ]

Thus...

- Positive A-stat indicates uniqueness. Negative A-stat indicates repetitiveness. Zero indicates uncertainty.
- An A-stat of 1 is 10X more likely to be unique than a 2-copy repeat in the genome.
- Celera Assembler considers a unitig with A-stat >= 5 to be unique. The unitig seeds a scaffold.
- Celera Assembler considers a unitig with A-stat between 1 and 5 to be possibly unique. The unitig will be tested for "branch pattern repeat" within the scaffold graph.
- Celera Assembler is suspicious of unitigs with A-stat < 1. These may be used as surrogates to fill scaffold gaps or left out of the scaffolds as degenerates.

### A-stat Implementation

A-stat is reported in the deflines of the *.utg.fasta output file. A-stat is stored in the tigStore database. A-stat is reported in the *.asm output file. Prior to CA 7.1, A-stat was calculated during the 4-unitig stage. Currently, A-stat is calculated during the 5-consensus-coverage stage. A-stat is computed by the computeCoverageStat module (run it without options to get a usage statement).

## Graph-based repeat discrimination

RBP stands for Repeat Branch Pattern. CA applies the RBP test to all unitigs immediately after graph construction and before contig construction. RBP examines the local graph of this unitig and all the unitigs connected to it by sequence overlaps. The unitig is considered unique if this analysis supports one and only one path through this unitig. The neighbor untigs must be above a minimum size and the edges above a minimum weight.

### RBP Implementation

CA applies RBP to unitigs. RBP "demotes" some of them from unique to repeat status. CA applies RBP in the 7-0-CGW stage of scaffolding. CA applies RBP after building a graph where unitigs constitute nodes and mate bundles constitute edges. CA applies RBP before constructing contigs from unitigs. Specifically, CGW calls DemoteUnitigsWithRBP() just before BuildInitialContigs(). A command-line option controls whether RBP is applied at all. RBP is not applied if a unitig that has been marked AS_FORCED_REPEAT, such as by toggling. RBP demotes the unitig from unique status both unitig ends have more than one qualifying overlap. To qualify, an overlap must be supported by MIN_EDGES-1 mates plus one sequence overlap and these must connect to a unitig of at least 1500 bases.

# Scaffold Construction

For a set of contigs, scaffolds define

- linear order (predicting order on some chromosome)
- relative orientation (corresponding to DNA strand)
- gap sizes (spacing between consecutive contigs).

Scaffolds are constructed from contigs plus mate pairs. Mate pairs give the relative orientation and spacing of two reads. The spacing is specified by a mean and standard deviation derived from the population of mate pairs generated from a single library. When the two reads of a mate pair occur in different contigs, the mate pair suggests the relative orientation and separation of the two contigs. A collection of mate pairs may offer contradictory signals. Thus, an individual mate pair within a scaffold may span part of a contig that contains both reads, a gap between contigs that contain the reads, or multiple gaps and contigs.

## CGW Scaffold Algorithm

Celera Assembler approaches the scaffold problem like so:

- Start with the graph of nodes representing individual unitigs and edges representing groups of consistent mate pairs.
- During initial scaffold building, apply transitive reduction to the graph. Infer edges implied by others. Build linear scaffolds. Break at path divergence. Estimate each gap size independently using only the mate pairs that connect the one upstream contig to the one downstream contig.
- To initiate scaffold merging, apply greedy heuristics to the graph. Select a pair of scaffolds to merge using the inter-scaffold edge of highest weight. Implicitly, this prefers joining the largest contigs, those that are closest on the genome, and those linked by the small-separation mates.
- To begin a particular merge, construct a preliminary contig layout from the order, orientation, and gap sizes of the component scaffolds.
- Select mates to use as the observations. Prune the mates that are mis-oriented in the merged layout. Prune the mates whose distance makes them outliers within their library; this is to preclude their dominating effect on least squares.
- Apply least squares to choose all the optimal size for all gaps simultaneously. First decide the layout for the merged scaffold; least squares will not change this. Formulate the gap sizes as the coefficients to be found and the mates as the observations. The system is over-specified so long as there are multiple mates per gap. Mates contribute to all the gaps they span.
- If least squares fails, iterate the process of mate inclusion and least squares calculation. Relax the inclusion criteria if the proposed scaffold becomes disconnected. In necessary, relax until all mates are used. Finally, if no solution is possible, either back out of the merge or estimate gap sizes in a greedy rather than optimal way.
- At negative gaps, attempt to assemble consecutive contigs by sequence alignment. Introduce an artificial gap (-20bp) if sequence alignment fails.
- Evaluate the merger by comparing mate satisfaction rates before and after the merge. Back out if necessary.

### CGW Scaffold Implementation

- In code, the edges selected to support a merge are called Trusted Edges.
- Summer 2012: it is not possible to back out of a scaffold merge if the merge test invoked a contig merge.
- Summer 2012: CGW does not recover elegantly from least squares failures. Failure is rare but it can lead to poor gap sizes or worse: large chimeric scaffolds with gigantic gaps.

## Gap Estimation by Least Squares

Gap sizes are computed from the contig layout plus the mate pair collection. For every gap, assume there is one size that will minimize the stretch and compression of all mates that span the gap. (Spanning mates may be anchored in the two contigs adjacent to a gap. Others may span whole contigs and multiple gaps.) For the scaffold as a whole, assume there is a vector of gap sizes that minimizes the stretch and compression of all mates in the scaffold. The stretch and compression is computed by evaluation of the mate pairs. By virtue of its library mean and variance, each mate pair has an expected separation. By virtue of its placement in the proposed scaffold, each mate pair has an observed separation. Each mate separation can be computed exactly from the sum of the lengths of the spanned scaffold elements: whole gaps, partial contigs, and whole contigs.

The optimum vector of gap sizes can be estimated by least squares using the mate pairs. Other possible methods include maximum likelihood or weights and springs. Least squares minimizes the sum of the squares of the normalized residuals. The residual is the difference between expected and observed. The square of the residual is divided by the variance (standard deviation squared) to normalize it. Since least squares is sensitive to outliers, it helps to compute it after removing outliers such as very-stretched mate pairs. Since the model cannot handle incoherent data, mate pairs at the wrong orientation must be ignored.

Assume the sum-of-squares is well-behaved with respect to changing gap sizes. For each gap, express the partial derivative of the sum-of-squares with respect to changing gap size. The derivative is zero where the optimal gap size minimizes residual across one gap. There are N equations in N unknowns representing the N gaps. Solving the system of equations gives the optimal N gap sizes that minimize residual across the scaffold. The system is expressed in matrix form and solved by standard linear algebra.

### Least Squares Implementation

The algebra is encoded by the programming library LAPACK. The library can optionally use a standard speed optimization involving Cholesky decomposition.

Celera Assembler's order of operations are

- DPBTRF - compute the Cholesky factorization of a real symmetric positive definite band matrix A.
- DPBTRS - solve a system of linear equations A*X = B with a symmetric positive definite band matrix A using the Cholesky factorization A = U**T*U or A = L*L**T computed by DPBTRF

The Cholesky factorization fails under some conditions, as described here (from | sba FAQ):

Q20 -- Why am I getting a LAPACK error about a failed factorization in sba_Axb_Chol()? -- [top] The full message looks like this: LAPACK error: the leading minor of order XX is not positive definite, the factorization could not be completed for dpotf2/dpotrf in sba_Axb_Chol() This means that the Cholesky-based solver for the augmented normal equations fails at some iteration. In theory, the matrix of the system being solved should be positive definite, thus Cholesky should always work. Due to numerical error, however, sometimes this is not true and the Cholesky factorization cannot be computed. Nevertheless, no harm is done since when an iteration fails, the damping of the normal equations is increased so in a future iteration their matrix will be positive definite. In other words, this error does not affect convergence. To prevent sba from complaining, comment the calls to sba_Axb_Chol() in sba_levmar.c and uncomment the ones to sba_Axb_LU(). These changes will result in using the LU decomposition in place of Cholesky for solving the normal equations and will get rid of the warning. Note, however, that LU decomposition takes longer to compute, so the modifications will render sba a bit slower.

In case of failure, the solution is to use a slower method. This fallback in not yet implemented in Celera Assembler.