Celera Assembler Terminology

From wgs-assembler
Jump to: navigation, search

An Assembly

An Assembly
  • An assembly is a set of scaffolds computed from reads.
  • A scaffold is an ordered and oriented set of one or more contigs with distances assigned to the gaps between contigs. In practice, each gap distance is computed from mate pairs that are anchored in neighbor contigs and span the gap. A scaffold implies a single sequence that possibly includes gaps.
  • A contig consists of a set of reads, a layout that includes all the reads and leaves no gaps, a multiple sequence alignment of the reads, and a consensus sequence. In practice contigs consist of one or more unitigs. Note the consensus may contain (small) gaps spanned by reads even though the layout includes no (0X) gaps.
  • A unitig is a special kind of contig. Ideally, it is fully consistent with all the data including reads, overlaps, and mate constraints. In practice, unitigs can only be consistent with most of the data. Conceptually, a unitig is a high-confidence contig. Maximal unitigs should contain either (1) unique sequence up to repeat boundaries, with less than a read-length of repeat on each end, or (2) nearly the full extent of a genomic repeat.

A Scaffold with a Surrogate

A Scaffold with a Surrogate

The Celera Assembler works with fragmentary sequences, their detected overlaps, and their given mate pairs. Often, the data are mutually contradictory, as shown here. Yet, Celera Assembler reduces the data to a linear sequence whenever that is justified.

(A) Sequence overlaps and mate pairs suggest several possible joins. Line segments represent fragments, vertical stacking represents overlaps, rectangles represent contigs, arrows represent links, and every element's thickness correlates to the amount of supporting data.

(B) The assembler reduces the graph such that one contradiction remains. The sequence fragments were reduced to contigs based on overlaps. The mate pairs were reduced to contig links of various weights. Here, three contigs form a linear scaffold but the fourth contig is problematic.

(C) The assembler has reduced the graph to a linear sequence. Its final step was to insert the 4th contig twice. Called a multiply placed surrogate unitig, the 4th contig appears to represent over-collapse of fragments induced by a near-perfect repeat in the genome.

Glossary

Assembly
(1) A layout and associated consensus sequence(s) and/or multi-alignment(s). In other words, we use this term to speak of a tentative reconstruction of segments of the target sequence and the locations from which the reads were sampled.
Branch Point
(1) A branch point is a position on a fragment and/or chunk that is known to represent the boundary of a repetitive element. The inference one would like to make is that one side of the branchpoint is unique sequence and the other is repetitive, but internal repeat boundaries of micro- and mini-satellites are also detected as branchpoints.
Consensus Sequence (or simply Consensus)
(1) Given a collection of overlapping reads, that do not precisely match along their overlaps, a consensus sequence for the collection is, loosely speaking, one's best guess at the sequence the reads were sampled from. Often people mean something more precise: the mathematical definition of consensus sequence is one for which the sum of the differences between the consensus sequence and each one of the reads is minimal.
Contig
(1) A maximal set of reads in a layout which in aggregate cover a contiguous interval.
(2) A contiguous join of unitigs. It consists of a multiple sequence alignment of reads plus a consensus sequence, although it also has an internal unitig structure. The consensus can have short gaps representing inserts in a minority of the underlying reads. The consensus can have regions of 0X read coverage when the consensus is due to a surrogate.
Degenerate
(1) A unitig that could not be combined into any scaffold. It is like a singleton but it has more than one read. Degenerates sometimes contain high-copy plasmid sequence. Degenerates can reflect biological phenomena that undermine the assumptions of Celera Assembler's mathematical model.
Fragment
(1) Either a guide or a read. Unfortunately this term has a long history of different uses by different groups. In particularly, one may actually be talking about inserts. Usually the intended meaning is clear from context, but when it isn’t and its important to understand the precise meaning, be sure to ask for clarification.
Guide (obsolete)
(1) A read-sized sequence of the relevant genome supplied from an external data source, e.g. an STS marker, a BAC-end, or a fabricated piece of a known BAC.
Insert
(1) A segment of the target genome placed into a vector and ultimately end-sequenced by us. For example, we are currently planning on sequencing the ends of a 4/1 mix of 2Kbp and 10Kbp inserts.
Layout
(1) A layout is a (partial) positioning of a set of reads with respect to each other subject to the one constraint that every pair of reads that overlap in the layout do so as defined immediately above. The term layout is intended to specifically speak to the arrangement of the reads as opposed to their mutual connectivity (as in "contig" below) or the sequence(s) the set models (as in "consensus" below). A layout includes the orientation of the fragments and in the case that reads are mate-linked gives the estimated distance between contigs that contain each end of a mate pairing.
Mate-Pair or Mates
(1) A pair of reads taken from the end of a given insert.
Multi Alignment
(1) A multi-alignment of a set of overlapping fragments is a matrix in which a row is a possibly empty prefix of blanks, followed by the sequence of a fragment interspersed with dashes, followed by a possibly empty suffix of blanks. One generally seeks the multi-alignment of the fragments that exposes their similarity and supports the evidence for a particular consensus sequence. Indeed, any computation that produces a consensus either implicitly or explicitly computes a multi-alignment of the underlying reads.
Overlap
(1) A pair of sequences, say A and B, overlap if there is an interval of A and an interval of B that match to within a user-specified level of similarity. If the sequencing error rate is less than 2% than a match with fewer than 4% differences constitutes an overlap. Typically, one is also implying that the segments involved constitute either a suffix/prefix pair (a "dovetail overlap") or all of one of the two sequences (a "containment overlap"). In pictures,
   A -------------------          or    A --------------------.
         ------------------- B                 ---------- B
Read
(1) A single sequence read produced by an ABI 3700 by our internal production pipeline.
Rocks/Stones
(1) Unitigs that were used to fill a gap in a scaffold. They are usually short and repetitive. Rocks require higher confidence joins than stones. (An even lower confidence category, pebbles, was discontinued after its use in the Celera assembly of Drosophila.) Rocks and stones are "thrown" into gaps late in the scaffold building process. They are thrown in multiple iterations, with the loop count controlled by a run-time parameter.
Scaffold
(1) A maximal set of contigs in a layout that are connected together by mate-links.
(2) A linear ordering of contigs joined by mate pairs. A scaffold defines the order and orientation (DNA strand) for each component contig. There are two ways to measure scaffold length. "Scaffold bases" is sum of contig lengths. "Scaffold span" is that plus the sum of gap lengths. Celera Assembler uses complex criteria to build scaffolds, but some generalizations apply. Every gap in a scaffold was spanned by at least two mate pairs. A gap with negative length means the sequence data and mate data disagree. Usually, negative gaps are small (20bp) and induced by low-quality sequence at the end of a read. In the FASTA representation of a scaffold, negative gaps are represented by a fixed number (20) of N's.
Singleton
(1) A read that could not assemble. Singletons can represent contamination, unique sequence with no overlap due to the fluctuation of random coverage, or sequence with so many overlaps it could not be assembled efficiently. It can happen that a mate pair has two singletons, and in some contexts these pairs are called mini-scaffolds.
Singleton Unitig
(1) A unitig consisting of a single fragment.
Surrogate
(1) A unitig whose arrival rate statistic was beyond the expected range. Such unitigs are treated as collapsed repeats. Their consensus may get placed in one or more scaffolds. Some of their reads may get placed, by mates, late in the pipeline. When a repetitive unitig cannot be placed even once, it becomes a degenerate.
Unitig (also Chunk)
(1) A high-confidence contig seed. The end of a unitig is, by definition, a place where the overlap data shows multiple, mutually contradictory, paths. Unitigs are supposed to end at repeats.
(2) A uniquely assembleable subset of overlapping fragments. A unitig and/or chunk is an assembly of fragments for which there are no competing choices in terms of internal overlaps. This means that a chunk is either a correctly assembled portion of a contig or it is an overcompressed assembly of several high-fidelity copies of a repeat. Every fragment belongs to one chunk.
UUnitig
(1) A unitig with an arrival rate statistic (based on unitig length and read coverage) within the expected range. The uniqueness designation becomes important during the scaffold building stage. Only a unique unitig can seed a contig. Contigs can be extended by mates and overlaps from their unique unitigs only.