ASM Files

From wgs-assembler
Jump to: navigation, search

The ASM file is, technically, the sole deliverable of the assembly pipeline. It provides a precise description of the assembly as a hierarchical data structure. It defines all elements of the generated assembly: reads and mate pairs, unitigs, contigs and scaffolds. The fate of (nearly) every read and mate pair input to the assembler (some short and/or low quality reads are discarded on input) is included, though read sequence and quality scores are omitted. Every unitig and contig includes the sequence and quality scores of the consensus and the multiple sequence alignment that produced it.

All other output files (QC Metrics, POSMAP and FASTA Files) are derived from the ASM file.

We are considering discontinuing the ASM file, in favor of smaller and easier to use outputs. Please contact us before developing significant infrastructure to support ASM files.

Messages

The ASM file is a text file composed of “messages” or multi-line records. Each message has a type. The messages are listed in a partial order. The order restriction is known as “definition before reference.” Every identifier must be defined before it is referenced. For example, a mate pair consists of two reads, so both read messages must precede the mate message that joins them. The order of the two reads is not defined, so the requirement is a partial order. The ASM messge order restriction is intended to make it easier to write streaming parsers. The easiest way to generate a valid ASM is to group messages by type and exploit the hierarchy of message types. For example, writing all mate messages after all read messages satisfies the order requirements for these message types. Celera Assembler does this currently.

Message and Nested Message Types in ASM Order
{MDI} mate distance estimate for one library
{AFG} augmented fragment with adjusted clear ranges
{AMP} augmented mate pair with assembled status
{UTG {MPS}} one unitig; its read map
{ULK} inter-unitig link
{CCO {MPS}{UPS}{VAR}} one contig; read map; unitig map; consensus variants
{CLK} inter-contig link
{SCF {CTP}} one scaffold; its contig map
{SLK} inter-scaffold link


Messages share a common structure. They start with the open curly bracket ({) and end with the close curly bracket (}). The first line of a message consists of the open-bracket followed by three capital letters followed by a newline. The three-letter acronym defines the message type. The last line of a message consists of just the close-bracket followed by a newline. The internal lines of a message have a format and order dictated by the specific message type.

The standard message layout presents one tag-value pair per line. Each line starts with three lower-case letters that define the tag. The tag is unique within the message. Note the same tag may have a different meaning in some other message type. The tag is followed by a colon and the colon is followed by the value. Single-valued tags have the value on the same line. Multi-valued tags with a constant number of values have the values on the same line separated by commas. Multi-valued tags with a variable number of values have all the values on subsequent lines.

Certain values are multi-line. These are long strings such as the DNA sequence of a contig. These values include newline characters for readability only. Input parsers should strip the newlines from the input stream. These fields are terminated by a period on its own line. Input parsers should strip the terminating period from the value (The period character is valid data. Message writers must take care to avoid writing a value-internal period on its own line. If the value ends in a period, the writer may write it on its own line, followed by the field terminator. Thus, a message parser should consider a period on its own line to be the field terminator only if it is not followed by a period on its own line. Some existing parsers fail to take this precaution.)

A message may contain other messages. The outer message type determines the order and type of its internal messages. At present, only one level of nesting is supported. At present, any list of internal messages must share the same type. Like every message, each internal message must end with the message-terminating curly bracket on its own line.

Identifiers: UID and IID

A UID is an externally usable identifier. It was originally required to be a 64-bit unsigned integer. That restriction was relaxed so now it can be any string. The Celera Assembler requires a unique UID for every entity, including each read, in the input FRG file. The Celera Assembler assigns a unique UID to every entity, including contigs and scaffolds, in the output ASM file. Celera Assembler can use an external UID generator, or it can generate UIDs starting at 1 on every run. As of Celera Assembler version 5, the UID of an input fragment can be any string, not just a 64-bit unsigned integer.

An IID is an identifier that is internal to one run of the Celera Assembler. An IID is a 32-bit unsigned integer starting at 0. Within the Celera Assembler pipeline, entities are tracked by IID only.

Some messages in the ASM file provide the UID and IID for the same entity. Pairs of (UID,IID) are useful for tracking entities such as unitigs through intermediate stages of the Celera Assembler pipeline.

Space-Based Coordinate Systems

Celera Assembler uses a “space-based” coordinate system to refer to the bases in a sequence. The space-based coordinates count the spaces before and after bases rather than the bases themselves. Zero always refers to the space before the first base. The sequence “ACGT” has coordinates (0,4) and its subsequence “CG” has coordinates (1,3). The difference between the start and end coordinates gives the sequence length. Misinterpretation of these coordinates can easily lead to “off-by-one” errors.

A C G T
0 1 2 3 4

Consensus Sequence

A unitig’s consensus sequence is calculated from a multiple sequence alignment of the underlying reads. A contig’s consensus sequence is calculated from a multiple sequence alignment of the underlying unitigs. A scaffold’s consensus is not explicitly represented in the ASM file.

Consensus-oriented messages contain these two consensus fields.

MESSAGE FORMAT EXPLANATION
cns:
string+
3-char field name, colon, newline, the consensus sequence. The data begins one line after the “cns” tag. The data consumes up to 70 characters per line. The field terminates with a period on its own line.
qlt:
string+
3-char field name, colon, newline, the consensus quality scores. There is one QV character for each character of the consensus sequence. The field terminates with a period on its own line.

The consensus sequence uses an alphabet of upper-case letters plus a dash. The dash is used to indicate the lack of a consensus base despite a base call in one or more underlying reads. The five characters of the consensus alphabet are: A,C,G,T,-.

A dash indicates a gap indicative of disagreement between the underlying reads. [It seems a dash can have a high quality score, probably based on the surrounding bases. Give the formula.] Dashes are counted in the sequence length.

Quality Values

Every consensus sequence is reported with an associated set of quality scores. There is one quality score associated with each base call in the consensus. The score is encoded as a single character. Thus, a consensus with 100 nucleotides will have a 100-character quality string. The encoding is simple: add 48 to the score and convert it to an ASCII character. The number 48 is ASCII for the printable character zero (‘0’). The underlying score is phred-like. The score is (-10)log(Probability of error). Lower-case L (‘l’) is the most common quality reported by Celera Assembler. Example:

FIELD DERIVATION VALUE
Chance of error 1/1,000,000 1*10-6
Phred score (-10)(-6) 60
Encoding ASCII[60]+'0' = ASCII[108] ‘l’

Consensus QV Algorithm

The algorithm below is a modification of the standard Chuchill-Waterman algorithm [Churchill & Waterman (1992) The accuracy of DNA sequences: estimating sequence quality. Genomics 14(1)].

Initial conditions

Let D = the depth (number of underlying reads) at position i.
Let B[j=1..D] = the column of input base calls in D reads at position i.
Let Q[j=1..D] = the column of input quality values in D reads at position i.
Let C = the output consensus base at position i.
Let QV = the output quality value, an integer between 0 and 60 inclusive, for position i.
Let A[k] = members of the 5-letter alphabet A={A,C,G,T, -}. Dash indicates a gap inserted by aligner.

Assumptions

Assume every input B[j] is an element of A.
Assume every Q[j] is an integer in the range 0-60.

Special case behavior

If Q[J]==0 then Q[j]=5.
If the read depth D= 0 at column i, then C=gap and QV=0. Note this can happen due to surrogate unitigs in repeats.
If the read depth D = 1, then QV = Q[1]. That is, the input base and QV are promoted to the consensus.
C is chosen to maximize QV. In case of a tie, C is chosen randomly from the tied bases.

Formula

Here is intuition. Consider only column i. For read j, we will calculate the amount of support for all k=1..5 possible consensus base calls. We will use the given quality value Q[j]. Since Q indicates the probability of an error, we will subtract from unity to get probability of correctness. The fraction 1/4 is a prior; assuming all 5 base calls are equally likely, the 4 bases not used are each responsible for 1/4 of the residual probability. Note that every read offers some support even for a consensus base that does not match the read.
<math>Pr(B_j == A[k] | Q_j) = \begin{cases} 1-10^{(-Q_j/10)} & B_j == A[k] \\ (\frac{1}{4})10^{(-Q_j/10)} & \mbox{otherwise} \end{cases}</math>
For all 5 possible consensus base calls (k=1..5), take the product of the support from all (j=1..D) reads.
<math>Q_k = \prod_{j=1}^{D} Pr(B_j == A[k] ) </math>
Choose the base call with the maximum support. Break ties randomly.
<math>Q_{max}=\max_{k=1}^{5}Q_k </math>
Normalize the quality value such that the sum of all probabilities equals unity.
<math>Q_{norm}=Q_{max}(\frac{1}{\sum_{k=1}^{5}Q_k } )</math>
Convert the probability of correctness to probability of error. Convert that to a phred-style quality score.
<math>QV=round((-10)log_{10}(1-Q_{norm}))</math>

Implicit Messages

Some information is not given explicitly in an ASM file.

  • A singleton is a read that was not incorporated into any scaffold. It is a singleton regardless of whether its mate pair was incorporated. In the ASM file, a singleton is represented as a unitig that (1) contains just the one read, and (2) is not a member of any contig message.
  • A degenerate is a multi-read unitig that was not incorporated into any scaffold. The fact that the degenerate is not itself a scaffold means that the unitig was treated as non-unique, probably because it failed the A-stat test for uniqueness. The fact the that degenerate was not placed into a scaffold means that its sequence and mate pairs were inconsistent with any such placement. In the ASM file, a degenerate is represented as a unitig that (1) contains more than one read, and (2) is not a member of any contig message.

ASM Specification

MDI

The MDI message describes a group of mate pairs that belong to the same library. It is hoped the the library was prepared such that all the mate pairs have approximately the same distance between the 2 mates. Traditionally, libraries are fragments that have been size-selected and end-sequenced together. Celera Assembler assumes that each library has a normal distribution of mate distances, and it calculates a mean and standard deviation per library. The MDI contains statistics about one library as observed after partial assembly. The statistics in the MDI may differ from those in the DST message from the input FRG file and from the mate placements in the final assembly. The MDI stats were derived from contigs at some intermediate stage of the assembly pipeline. The MDI stats were used as targets during the final stage of the assembly pipeline.

MESSAGE FORMAT EXPLANATION
{MDI Modified Distance Message.
ref:(UID,IID) External and internal unique IDs.
mea:float32 Mean of distances.
std:float32 Standard deviation of distances.
min:int32 Minimum distance.
max:int32 Maximum distance.
buc:int32 Number of buckets in histogram. Non-negative integer.
his:
string+
Histogram. One non-negative integer per line.
} End of MDI

ref: Two ID's are given. The first is an externally unique UID; the value is the same string as the DST acc field in the input FRG file. The second is an internal IID; the 32-bit integer values associates this message with IMD messages passed at intermediate stages of the Celera Assembler pipeline. Thus every library has a history of messages {DST, IMD+, MDI} that give the successive estimates of the library statistics.

his: The MDI contains a histogram of distances observed in one library. A normal distribution indicates a well-constructed library. Other distributions can indicate library problems. For instance, a bi-modal distribution suggests mates from two libraries were mistakenly assigned the same library ID. The mate-to-library assignment is taken from the DST message in the input FRG file. The distances reflect empirical observation of the output contigs. Thus, the mean and standard deviation in the MDI message (in *.asm output) may not equal those of the DST message (in *.frg input). Celera Assembler relies on the input values only during initial scaffold construction; later stages use empirical values. The library means are used, for instance, to determine lengths of gaps between contigs in a scaffold. In the histogram, the size of each bucket is size=(max-min+1)/buc. The range of the first bucket is range=[min, min+buc]. [MORE DETAIL PLEASE. Can users see the histogram of distances observed in the final version of the scaffolds?]

AFG

The AFG message describes one read and its status in the assembly. The AFG message corresponds to an FRG message in the input FRG file (*.frg). The AFG has no sequence or quality fields. This reduces the bulk of the ASM file, but it means users should retain the input FRG file and the output ASM file to have a complete data set. The FRG and AFG clear ranges may differ as they represent the input and output values respectively.

MESSAGE FORMAT EXPLANATION
{AFG Assembled Fragment Message
acc:(UID,IID) Unique ID and accession for internal handling
mst:char Mate status. See below.
chi:int32 Is chimeric? Not used. Always 0.
cha:int32 Is chaff? 1=singleton, 0=otherwise.
clr:int32,int32 Modified clear range. Max=[0,length-1].
} End of AFG

clr: The AFG clear range reflects the region declared trustworthy by the Celera Assembler trimming module. The specific trimming module and its parameters depend on command-line parameters, but the default module is overlap-based trimming (OBT).

cha: A read is chaff if it could not be assembled. Chaff and singleton are synonymous. As of CA 5.2, this flag seems slightly inaccurate; counting the number of reads not placed in unitigs is more difficult but it may be more reliable.

mst: The AFG mate status indicates the relative positioning of the mate pair in the assembly. See the discussion under the AMP message description. Both reads in a pair must have the same mate status.

AMP

The AMP message describes a mated pair of reads. It give the mate pair’s status in the assembly. The message is a duplication of information, provided for convenience. The pairing of reads is a duplication of information in the LKG link messages of the input FRG file. The mate status field (“mst”) reflects the combined status of the individual reads. The AMP in the output ASM file corresponds to an LKG message in the input FRG file (*.frg).

MESSAGE FORMAT EXPLANATION
{AMP Assembled Mate Pair
frg:UID Unique ID for one read
frg:UID Unique ID for other read
mst:char Mate status.
} End of AMP

frg: The fragment field occurs twice. Each field identifies a distinct read. Each read is identified by UID. [Do unmated reads appear in the AMP? Do reads appear as mates if either or both were trimmed to nothing?]

mst: The mate status field indicates the relative positioning of the mate pair in the assembly. A one-letter code is used. A “good” status means this pair assembled to one scaffold with the proper orientation and within acceptable distance. (Acceptable is the library mean ± 3 standard deviations of the library mean). A “bad” status means co-placement on one scaffold but relative placement that is too far, too short, or mis-oriented. Other status codes indicate that one or both reads were output in a singleton, a degenerate unitig, a surrogate unitig, or a distinct scaffold. An “unassigned” status should never appear; it would indicate that the scaffold module had not processed the mate pair. The source code contains the following list of mate status codes for the enumerated type MateStatType. Note that “chaff” is equivalent to singleton, which refers to a single read with no overlap in the assembly.

STATUS NAME
Z UNASSIGNED_MATE
G GOOD_MATE
C BAD_SHORT_MATE
L BAD_LONG_MATE
S SAME_ORIENT_MATE
O OUTTIE_ORIENT_MATE
N NO_MATE
H BOTH_CHAFF_MATE
A CHAFF_MATE
D BOTH_DEGEN_MATE
E DEGEN_MATE
U BOTH_SURR_MATE
R SURR_MATE
F DIFF_SCAFF_MATE

UTG

The UTG message describes a unitig. A unitig is a high-confidence contig.

Most unitigs are components of contigs. Some unitigs are themselves a contig; these had been regarded as unique sequence although they did not assemble further. Some unitigs will be unplaced “degenerates” or multiply placed “surrogates”; these had been regarded as repetitive sequence and could not seed a contig. Output UTG messages correspond to internal IUM “intermediate unitig messages” generated during the Celera Assembler pipeline. The IUM and UTG messages are linked by a common IID. All UTG messages have a corresponding IUM in the scaffold file (7-CGW/*.cgw). Most UTG messages have a corresponding IUM in the unitig file (4-unitigger/*.cgb) and the post-unitig consensus file (5-consensus/*.cgi); only the post-consensus messages include a consensus sequence. The scaffold module generates some unitigs by splitting; the IUM’s for these unitigs appear in the scaffold module output but not the unitig module output.

MESSAGE FORMAT EXPLANATION
{UTG Unitig Message
acc:(UID,IID) Unitig external and internal ID's.
src:
string+
.
Reserved for programmers. Multiple lines terminated by a period line.
cov:float32 Coverage expressed as arrival rate “A” statistic.
mhp:float32 Measure of polymorphism.
sta:char Unitig status.
abp:int32 Unused. Was branch point. Now 0.
bbp:int32 Unused. Was branch point. Now 0.
len:int32 Number of bases in the consensus sequence.
cns:
string+
.
The consensus sequence.
qlt:
string+
.
The consensus QV scores.
for:int32 Alignment was forced? 0=no, 1=yes.
nfr:int32 Number of reads. Same as number of MPS messages embedded in this message.
{MPS}+ Nested messages mapping reads to the unitig.
} End of UTG.

acc: The accession field contains the external and internal ID’s. The UID is a unique accession across all entities in the ASM file. The IID is unique across unitigs in the ASM file. Unitig IID’s start at 0 and increase by 1. The IID’s in the ASM may not be continuous because not all intermediate unitigs survive to the output stage.

src: The source field will not appear in output from CA version 6 and later. When parsing files from older versions, this field should be treated as a comment. In output from CA version 5, the UTG source included the MHP score in a string like “mhp:1.000000e+00”.

sta: The status field contains a one-letter code is used. The status field indicates the untig’s disposition after scaffold construction. See table below. See also the source; values like UnitigStatus = AS_UNIQUE = (int)'U' are assigned by OutputUnitigsFromMultiAligns (in Output_CGW.c).

UNITIG STATUS MEANING OUTCOME
U Unique Placed in scaffold or promoted to its own scaffold
S Repeat, Surrogate Added cautiously at one or more scaffold locations
N Repeat, Degenerate Left out of the assembly as a degenerate
C Chimer No longer used. Now, chimers get broken.
X Unresolved Always changed to S or N during scaffolding.

cov: The coverage field contains the A-stat. It is formulated as a log likelihood ratio. It gives the likelihood this unitig derives from a unique locus of the genome, as opposed to being a collapse of reads from two copies of a genomic repeat. It is calculated from unitig length, read coverage, the total number of reads being assembled, and an estimate of genome size. (The genome size can be set with a command-line parameter. By default, genome size is estimated with a bootstrap procedure that extrapolates from the coverage in a set of large, low-coverage unitigs.) An A-stat of 0.0 indicates no preference. Too-short unitigs have their A-stat set to zero. Negative values indicate repetitiveness. Positive values indicate uniqueness. The Celera Assembler scaffold module uses thresholds of 1.0 for unique and 5.0 for very unique; these are settable parameters.

mhp: This is an experimental measure of polymorphism observed between reads in the unitig. This field of the UTG message was introduced in CA version 6 though it was not yet actually used by the assembler.

nfr: The number-of-fragments field must contain a positive integer. A value of 1 is possible. The unitig module does not generate single-read unitigs. However, the scaffold module can incorporate individual reads during gap-filling operations.

UTG MPS

The “multi-pos” message gives the approximate layout of one read to one unitig. The unitig IUM message contains one MPS for each read in the unitig. In the output ASM file, MPS messages occur within UTG messages. In files from intermediate stages of the Celera Assembler pipeline, the ancestral data appears as internal-mate-pair (“IMP”) messages within internal-unitig (“IUM”) messages.

The unitig message (“UTG”) and the contig message (“CCO”) both contain embedded MPS messages. Both MPS types have the same structure and semantics.

MESSAGE FORMAT EXPLANATION
{MPS Read-to-unitig mapping.
typ:char Read type. Usually ‘R’ for random read. Same as the read type provided in input FRG file’s FRG message.
mid:UID The read identifier.
src:
string+
.
Reserved for programmers. Currently empty.
pos:int32,int32 Start and end positions of the read mapped to the unitig. Zero-based. Relative to the unitig. If (end<start) then the read’s reverse complement maps to the unitig.
dln:int32 Number of integers in the delta encoding.
del:
int32+
Delta encoding of the alignment to consensus sequence. See below.
} End of MPS.

del: The so-called delta encoding is an efficient way to store a pair-wise alignment. The encoding is a series of positive integers, delimited by single space characters, spanning one or more lines. It uses zero-based, space-based coordinates; zero refers to before the 1st base and one refers to between the 1st and 2nd bases. The encoding stores the alignment of one read to the consensus. The encoding assumes the consensus is gapped; the consensus must include sufficient dash characters to leave enough space for every base in every read aligned to it. The encoding assumes the read is ungapped; the delta encoding will provide the gap locations, if any. The encoding assumes trimming; it applies to the sequence that results from application of the AFG clr: field. The encoding applies to the given strand; if start<end in MPS pos:, then it applies to the read sequence, but if start>end, then it applies to the read's reverse complement. The sequence of the read is not present in the ASM file; it should be in the FRG file that was input to the assembler. Simple example: if the trimmed forward-strand read sequence is ACGT, and if the delta is 2 2, then the alignment is AC--GT. Complex example: if the delta encoding is empty, and if the read sequence is AAAGGG, and if the clear range in AFG clr: is (3,6), and if the MPS pos: field is (8003,8000), then the alignment is CCC; in words, the read AAAGGG gets trimmed to GGG, reverse complemented to CCC, and aligned to consensus bases 8001, 8002, 8003 without any gaps in the read.

ULK

The unitig link messages indicate connections between unitigs. They summarize the edges in the unitig graph whose nodes are unitigs. The graph’s edges are induced by mate pairs that have one read in each unitig.

MESSAGE FORMAT EXPLANATION
{ULK Start of unitig link message.
ut1:UID Unitig UID.
ut2:UID Unitig UID.
ori:char Orientation.
ovt:char Overlap type. An overlap type of 'N' means that all edges are mate-based, and there will be 'num' 'jls' entries. Other overlap types indicate an edge is overlap-based, and there will be 'num'-1 'jls' entries.
ipc:bool Is possible chimera? 0=no, 1=yes.
gui:bool Includes guide? (Obsolete field, removed in 6.x release.)
mea:float Mean of edge distances. May be negative.
std:float Standard deviation of edge distances.
num:int32 Number of contributing edges.
sta:char Status.
jls:
UID,UID,char +
Jump list. One or more lines. Each line contains 3 fields separated by comma. The 3rd element indicates status.
} End of ULK.

ori: The orientation field indicates the relative orientation of the two unitigs being linked. The values are one of {N=normal=AB_AB, A=anti-normal=BA_BA, O=outie=BA_AB, I=innie=AB_BA}.

sta: The status field indicates the status with a single character. The values are one of {A=in assembly, P=polymorphism, B=bad, C=chimera, U=unknown}. [What is the meaning and consequence of each value?]

ovt: The overlap type field describes the pair-wise alignment of unitigs induced by the collection of mate pairs in this link. It takes on a single character value that is one of {N=no overlap, O=regular overlap, T=tandem overlap, C=containment of 1 by 2, I=containment of 2 by 1}.

num: This field contains the number of edges N in the unitig graph that link these two unitigs. This field is related to L, the number of lines in the jump list. If N=L, then all the edges represent mate pairs with one read in each unitig. If N=L+1, then there is one edge representing a unitig overlap detected by sequence alignment. No other possibilities are valid. Note the relation to the overlap type ovt: ovt = ’N’ implies N=L, whereas ovt != ‘N’ implies N=L+1.

jls: The jump list is formatted as one or more lines. Each line has 3 fields separated by commas. The first field of each line is a UID for a read that is a member of the “ut1” unitig. The second field of each line is a UID for a read that is a member of the “ut2” unitig. The third field is a single-character status, one of {M=mate, B=BAC, S=STS, Y=may join, T=must join}. In recent versions (after year 2000) of Celera Assembler, the jump list element status is always M. [This paragraph needs verification.]

CCO

The contig message describes one contig. The contig represents a contiguous span of the target genome. The contig contains a layout of unitigs. Some unitigs are special cases, including surrogate unitigs and singleton-read unitigs. The contig has a consensus sequence, possibly with gaps induced by insertions in a minority of the underlying unitig consensus sequences.

MESSAGE FORMAT EXPLANATION
{CCO Begin contig message.
acc:(UID,IID) Accessions. External and internal unique IDs. The parenthesis and comma are included.
pla:char Placement status.
len:int32 Length.
cns:
string+
.
Consensus sequence.
qlt:
string+
.
Consensus QV scores.
for:bool Is forced? (obsolete field)
npc:int32 Number of reads, or 'pieces'. Indicates number of {MPS} records to follow.
nou:int32 Number of unitigs. Indicates number of {UPS} records to follow.
nvr:int32 Number of variants. Indicates number of {VAR} records to follow.
{VAR}+ Nested messages each defining a variant (alternate allele) consensus.
{MPS}+ Nested messages each mapping one read to the contig.
{UPS}+ Nested messages each mapping one unitig to the contig.
} End of CCO.

pla: The placement status field indicates whether the contig was placed in a scaffold. The value is one of {P=placed, U=unplaced}. Placed contigs are members of scaffolds with one or more contigs. Note a placed contig may itself constitute an entire scaffold. Unplaced contigs are also called “degenerate” contigs.

len: The length field indicates the length of the consensus sequence. The length includes any gap characters, which are represented by a dash.

CCO VAR

The variance (“VAR”) messages indicate alternative sequences for small regions of the contig consensus. The varlist format is peculiar to the VAR messages. A varlist is a list of positive integers delimited by single slash characters; it starts on its own line and it is terminated by a period line.

MESSAGE FORMAT EXPLANATION
{VAR Start of variance message.
pos:int32,int32 Start and end coordinates relative to contig consensus.
nrd:int32 Number of reads spanning the variant positions.
nca:int32 Number of predicted variants (including consensus).
anc:int32 Anchor size used to detect variant.
len:int32 Length of the variant. (=end-start?)
vid:IID Accession for this variant.
pid:IID Accession for another variant phased with this one.
nra:
varlist
.
Number of reads contributing to each predicted variant.
wgt:
varlist
.
Weight ratio of variants.
seq:
varlist
.
Sequences in the variants.
rid:
varlist
.
Accessions (IID) of supporting reads.
} End of VAR.

pos: Two non-negative integers are delimited with a comma. The second is greater than the first. They specify the begin and end, in space-based coordinates, of the span on the consensus sequence, which includes gaps. The difference between the two coordinates is reported in the length (“len”) field.

nrd: This gives the read coverage at the variant position. Some of the spanning reads may not have contributed to the consensus or the variant, due to a low-quality base call or lack of confirmation in the other reads.

nca: The number of variants is always 2 or greater. It is equal to the number of integers in each varlist that appears in other fields of the VAR message.

anc: This displays the value of the anchor size parameter. The value was set at the Celera Assembler run time (it is 11 by default).

Example. Below, the algorithm predicted 2 variant sequences for a length=1 region. (A length=1 variation is usually called a SNP.) The major variant, ‘G’ was seen in 3 reads. The minor variant ‘A’ was seen in two reads. The major variant received a higher weight and was reported in the contig consensus at position 131.

{VAR
pos:131,132
nrd:5
nca:2
anc:11
len:1
vid:24
pid:23
nra:
3/2
.
wgt:
109/73
.
seq:
G/A
.
rid:
415362/427751/62439/289752/366692
.
}

CCO MPS

Contig messages contain these nested MPS messages. Each MPS maps a single read to the contig. The format of these nested messages mirrors that of the unitig UTG/MPS. See the unitig message description for details.

CCO UPS

Contig messages contain nested UPS messages. Each UPS maps a single unitig to the contig.

MESSAGE FORMAT EXPLANATION
{UPS Begin unitig map to contig.
typ:char Type.
lid:UID The unitig identifier. The confusing field name is spelled L-I-D like the top of a jar.
pos:int32,int32 Mapping coordinates. Two non-negative integers delimited by a comma.
dln:int32 Number of integers in the delta encoding.

See description in the UTG MPS message.

del:
int32+
Delta encoding of the alignment to consensus sequence.

See description in the UTG MPS message.

} End of UPS.

typ: The type is one of {U=Unique, R=Rock, S=Stone, P=Pebble, s=Single-read}. The pebble type has not been used since the 2000 assembly of Drosophila melanogaster.

pos: Space-based coordinates, relative to the contig gapped consensus, for the begin and end of this alignment. If (end<begin) then the contig used the reverse complement of the unitig's consensus sequence.

CLK

The contig link (“CLK”) messages indicate connections between contigs. Each link message summarizes one or more edges in a graph whose nodes are contigs. The link has a positive multiplicity corresponding to the number of graph edges it represents. All the edges in one link connect the same contig pair. Each edge is formed by one mate pair with one read in each contig.

MESSAGE FORMAT EXPLANATION
{CLK Start of contig link message.
co1:UID Contig UID.
co2:UID Contig UID.
ori:char Orientation.
ovt:char Overlap type. An overlap type of 'N' means that all edges are mate-based, and there will be 'num' 'jls' entries. Other overlap types indicate an edge is overlap-based, and there will be 'num'-1 'jls' entries.
ipc:bool Is possible chimera? 0=no, 1=yes.
gui:bool Includes guide? (Obsolete field, removed in 6.x release.)
mea:float Mean of edge distances. May be negative.
std:float Standard deviation of edge distances.
num:int32 Number of contributing edges.
sta:char Status.
jls:
UID,UID,char +
Jump list. One or more lines starting on the line after the field name. There is no terminating period line. Each line contains 3 elements separated by comma. The 3rd element indicates status of the edge.
} End of CLK.

The contig link message is the direct analog of the unitig link (“ULK”) message, which links unitigs instead of contigs. See the ULK message for details about the fields.

SCF

The scaffold messages define one scaffold per message. A scaffold consists of either one contig, or of multiple contigs and their relative coordinates. In either case, the scaffold is the maximal unit of contiguous sequence output by Celera Assembler.

The scaffold message lists contig pairs, not contigs. A scaffold with three contigs {1,2,3} would be represented by the two pair messages (1,2) and (2,3). A scaffold with only one contig {1} would be represented by the pair message (1,1).

MESSAGE FORMAT EXPLANATION
{SCF Start of scaffold message.
acc:(UID,IID) Scaffold accessions, external and internal. The parenthesis and comma are included.
noc:int32 Number of contig pairs. Indicates number of nested messages to follow.
{CTP}+ One or more contig pairs as nested messages.
} End of scaffold message.

noc: The number of contig pairs is one less than the number of contigs in the scaffold. When noc=0, the scaffold consists of exactly one contig, and the lone CTP message shows that contig linked to itself at zero separation. When noc>0, the scaffold consists of multiple contigs whose relative order, orientation, and separation are derived from mate pairs shared between pairs of contigs.

SCF CTP

The contig pair messages define a pair of contigs that belong to a scaffold. The CTP is a nested message and it always occurs inside a scaffold (SCF) message. The contig pair message defines the relative order, orientation (DNA strand), and separation between two contigs within one scaffold.

MESSAGE FORMAT EXPLANATION
{CTP Start of contig pair message.
ct1:UID External accession of the first contig.
ct2:UID External accession of the second contig.
mea:float Distance between contigs, given as distribution mean.
std:float Standard deviation of the distance distribution.
ori:char Relative orientation of the two contigs.
} End of contig pair message.

ct1: The first contig field gives the ID of one contig in this pair. The ID is a UID, as described elsewhere. The first contig of the first pair message has a special status. As the first contig in a scaffold, its forward strand sequence is incorporated in the scaffold. All subsequent contigs are represented by their forward or reverse strand, as determined by their relative orientation to the first contig.

ct2: The second contig field gives the ID of the other contig in this pair. If ct2=ct1, then the scaffold has only one contig and the enclosing SCF message incorporates only this one CTP message; in this case, the other fields (mean, standard deviation, orientation) contain arbitrary values that should be ignored.

ori: The relative orientation of the contigs is represented by a single letter. ‘N’ for normal indicates forward-forward. ‘A’ for anti-normal indicates reverse-reverse. ‘O’ for outie indicates reverse-forward. ‘I’ for innie represents forward-reverse. Note the first contig of a pair can have a reverse orientation to preserve its orientation in the previous CTP message.

mea: The mean distance gives the predicted number of bases in the gap between the contigs. It is measured from contig end to contig end. A negative distance indicates that the contigs overlap (according to their aggregate mate pairs) though their consensus sequences do not align. In the FASTA representation of a scaffold, negative gap lengths are represented arbitrarily by 20 N's.

SLK

The scaffold link messages define a pair of scaffolds and lists the mate pairs they have in common. By definition, the mates in the scaffold link were not used to build a scaffold; they may have been too few in number or inconsistent with other mates.

MESSAGE FORMAT EXPLANATION
{SLK Start of scaffold link message.
sc1:UID External accession of the first scaffold.
sc2:UID External accession of the second scaffold.
ori:char Relative orientation of the two scaffolds.
gui:int Includes guide? 0=No, 1=Yes. (Obsolete field.)
mea:float Distance implied by the mates, given as a mean.
std:float Standard deviation of the distance distribution.
num:int Number of mates in the jump list.
jls:
UID,UID,char +
Jump list of mate pairs supporting the link.
} End of scaffold link message.

jls: The jump list is a list of mate pairs. The list includes one mate pair per line. The number of lines is given in the previous field (num). Each line contains two UID’s, one for each read in a single mate pair. Each line also contains a type, indicated by a single letter. Possible values are ‘M’ = mate pair or ‘R’ = reread; in practice it is always M.

Resources

Source Code

The source is useful as documentation. In fact it is the most precise form of documentation. The source includes one file that defines all the message types. This is a C header file. This file defines the message structures as they would be loaded into computer memory by a C program. The associated text file gives a quick introduction to the message handling mechanics. The associated C file has functions with names like Write_UTG_Mesg, where UTG is a message type.

The source has utilities for writing message parsers. The Celera Assembler naturally has code for reading and writing “message” files like as the FRG and ASM. This includes a fast C library. For example, here is a C++ program that relies on that C library:

The distribution includes some perl that parses messages directly. Perl is slower than C, and perl parsers are slow on large ASM files. Here is an example.

One can write perl that binds to the C library for faster parsing. A helpful library was kindly contributed by a Celera Assembler user. Here is sample code.

Utilities

The Celera Assembler is distributed with a utility to extract messages by type from an ASM file. The utility also works on FRG files and all message files. The utility program is built along with the rest of the software.

Converters

It is possible to convert ASM files to other file formats.

  • AMOS is a 3rd party suite. It includes a tool (cavalidate) to convert ASM to the AMOS format called BANK, and from BANK to other formats such as ACE.
  • Another 3rd party tool is asm2ace.
  • The Celera Assembler is distributed with a utility, ca2ace.pl, for converting the CA ASM output file to an ACE database. This utility works on trivial examples, but it may not work on complex data. In particular, it does not resolve the issue of 0X regions on CA assemblies. Most of these are regions with a "placed surrogate" consensus, where a repeat unitig could be placed even though its individual reads could not be placed. Consed cannot load an ACE with 0X regions. (Another class of 0X regions is very small ones at contig ends. These reflect defects in Celera Assembler that are hard to fix. They are products of sequence shifts during the final multiple sequence alignment.)

Older Documentation

An excellent document was generated at Celera during its generation of the draft human assembly in 2000. That document is no longer up to date. However, it was thorough and accurate. It is still a good reference. (The "human" version is a revision to the other.)

  • Document: src/AS_DOC/BigPicture/IOSpecDoc_Human.rtf
  • Document: src/AS_DOC/BigPicture/IOSpecDoc.rtf.