POSMAP

From wgs-assembler
Jump to: navigation, search

POSMAP (POSitional MAPping) files are part of the Celera Assembler output. The files are named *.posmap.* and are written to the 9-terminator directory.

POSMAP files are derivative files. They are generated by parsing the primary output file. Compared to the ASM, the POSMAP files are user-friendly. POSMAP files are easy to work with using standard Unix tools like grep, awk, and perl.

POSMAP files are line-oriented text files. Each line is a self-contained record. Each line includes two or more data files separated by white space. Each line typically describes the relationship between a small object and a large object. For example, there are POSMAP files for the read-to-contig relationship. Filenames encode what relationship is being described. For example, "myGenome.posmap.frgctg" describes the relationship between reads (or fragments) and contigs.

Some POSMAP files have special formats, as noted later. Most POSMAP files have the following line format.

small_ID big_ID begin end orientation

The begin and end coordinates define an interval on the big_ID object. POSMAP coordinates are space-based, referring to the spaces around the bases, starting with zero. For example, the first ten letters in any sequence are specified by begin=0, end=10 (not 1 and 10). The coordinates always refer to sequence after the alignment-induced gaps were squeezed out. Thus, POSMAP coordinates correspond to the Celera Assembler FASTA files. POSMAP coordinates do not correspond to the sequences in the ASM file, as those contain gaps. (In this discussion, "gap" refers to a short span represented by one or more dashes. The gap indicates where a minority of spanning reads contain an inserted base. This is in contrast to the long gaps, represented by N's, between contigs in a scaffold.)

posmap.frags

List of reads.

readID clearRangeStart clearRangeEnd statusInAssembly distanceToMate
clearRange
The trusted sequence range. If run parameters allowed Celera Assembler to trim the reads, then the output clear range may differ from the input clear range. Coordinates are base-based starting at zero.
statusInAssembly
Values are placed or chaff. Tells if the fragment was assembled (placed in a scaffold) or left unused (chaff). Chaff fragments are reported in the singleton.fasta output file.
distanceToMate
An integer, in bases, or the string notMated.

posmap.mates

List of mated reads (aka, paired ends, mate pairs).

firstReadID secondReadID mateStatus

The mateStatus field describes how well the mate edge agrees with the assembly.

good
both in same scaffold at proper distance & orientation
badLong
both in same scaffold but distance is stretched
badShort
both in same scaffold but distance is crunched
badOuttie
both in same scaffold but orientation is wrong (Outtie: <-- --> instead of expected Innie: --> <--)
badSame
both in same scaffold but orientation is wrong (Same: --> --> instead of expected Innie: --> <--)
diffScaffold
both are in different scaffolds
bothChaff
both are singletons
oneChaff
one not in chaff, other is singleton. Unplaced contigs (degenerates) are counted as non-chaff for this statistic so a mate pair with one singleton fragment and one fragment in a degenerate will have oneChaff status.
bothSurrogate
both are in surrogates (repeat unitigs)
oneSurrogate
one in scaffold, one unplaced in surrogate. CGW will try to place surrogate reads in exactly one surrogate instance if it can satisfy the mate pair. When this happens, the mate pair will have a status of good, not oneSurrogate.
bothDegen
both are in degenerates
oneDegen
one in scaffold, one in degenerate. This status takes precedence over oneSurrogate. For example, if there is a mate pair with one fragment in a surrogate and one fragment in a degenerate, the mate status will be oneDegen, not oneSurrogate

posmap.utglen

Unitig length.

unitigID unitigLength

posmap.utglkg

Unused evidence that two unitigs should be joined in a scaffold.

unitigID unitigID orientation isOverlap isChimeric meanDistance variance numLinks status links ...
orientation
'I' = first unitig is forward, second is reverse. 'O' first unitig is reverse, second is forward. 'N' both unitigs forward. 'A' both unitigs reverse.
overlapType
either 'N' for mated fragments, or 'O' for overlapping fragments.
isChimeric
either 'chimeric' for a possible chimeric link, or '.'.
status
'A' = in assembly, a trusted edge. 'B' = bad, an untrusted edge. 'U' = not known to be good or bad
link
A triplet of fragID,fragID,type, where type is 'M' for mate-pair and 'O' for overlap.

posmap.deglen

Degenerate contig length.

degenerateContigID unitigLength

posmap.ctglen

Contig length. (Contigs in scaffolds only.)

contigID contigLength

posmap.ctglkg

Unused evidence that two contigs should be joined in a scaffold.

contigID contigID orientation isOverlap isChimeric meanDistance variance numLinks status links ...
orientation
'I' = first contig is forward, second is reverse. 'O' first contig is reverse, second is forward. 'N' both contigs forward. 'A' both contigs reverse.
overlapType
either 'N' for mated fragments, or 'O' for overlapping fragments.
isChimeric
either 'chimeric' for a possible chimeric link, or '.'.
status
'A' = in assembly, a trusted edge. 'B' = bad, an untrusted edge. 'U' = not known to be good or bad
link
A triplet of fragID,fragID,type, where type is 'M' for mate-pair and 'O' for overlap.

posmap.scflen

Scaffold span. The span includes any N's between contigs, it is not the same as the number of bases in a scaffold.

scaffoldID scaffoldLength

posmap.scflkg

Unused evidence that two scaffolds should be joined in a scaffold.

scaffoldID scaffoldID orientation meanDistance variance numLinks links ...
orientation
'I' = first scaffold is forward, second is reverse. 'O' first scaffold is reverse, second is forward. 'N' both scaffolds forward. 'A' both scaffolds reverse.
link
A triplet of fragID,fragID,type, where type is 'M' for mate-pair and 'O' for overlap.

posmap.frgutg

Mappings of fragments to unitigs.

fragmentID unitigID begin end orientation

posmap.frgdeg

Mappings of fragments to degenerate contigs.

fragmentID degenerateID begin end orientation

posmap.frgctg

Mappings of fragments to contigs.

fragmentID contigID begin end orientation

posmap.frgscf

Mappings of fragments to scaffolds.

fragmentID scaffoldID begin end orientation

posmap.utgdeg

Mappings of unitigs to degenerates.

unitigID degenerateID begin end orientation

posmap.utgctg

Mappings of unitigs to contigs, but only for contigs in scaffolds. In particular, degenerate contigs are not listed here.

unitigID contigID begin end orientation

posmap.utgscf

Mappings of unitigs to scaffolds.

unitigID scaffoldID begin end orientation

posmap.ctgscf

Mappings of contigs to scaffolds.

contigID scaffoldID begin end orientation

posmap.vardeg

Consensus variant positions in degenerates. See poasmap.varscf below.

posmap.varctg

Consensus variant positions in contigs. See posmap.varscf below.

posmap.varscf

Consensus variant positions in scaffolds. Each line points to a small region for which two or more reads support an alternate sequence that is not represented in the FASTA files.

variantList scaffoldID begin end numReads numAlleles anchorSize length numReadsList weightList fragmentsList

Several of these fields are lists of variants. Each list uses the same ordering; the Nth item in each list is for the Nth variant.

variant
List of variant consensus sequences.
scaffoldID
The scaffold this variant occurs in.
begin
The start position of the variant (space-based).
end
The end position of the variant (space-based).
numReads
The number of reads that span this variant.
numAlleles;
The number of variants. Each list will have exactly this number of items.
anchorSize
An algorithm parameter; variants separated than this or fewer bases might be phased together.
length
The number of bases in the variant.
numReadsList
For each variant, the number of reads that support it.
weightsList
For each variant, the strength of the variant.
fragmentsList
The list of reads that span this variant. The reads are listed in order; for a variant with numReadsList = "1/2", and fragmentsList = "A/B/C", fragment A supports variant 1 and fragments B and C support variant 2.