FRG Files

From wgs-assembler
Jump to: navigation, search

Celera Assembler takes its inputs from "FRG" (frag) files. You can have as few (1) or as many (millions) as you want.

These FRG files consist of sequencer reads and relationships between the reads. Two types of relationships are defined: libraries and mates. A library indicates that all reads in this collection come from the same insert library and thus share numerous properties: end orientation, clone size, randomness, approximate read size, etc. A mate indicates that exactly two reads are from opposite ends of a single clone in a library.

Generating FRG files

The Celera Assembler expects input fragment data to be in the FRG format. We provide several utilities for converting a variety of data types into this format:

  • convert-fasta-to-v2.pl - converts sequence and quality values in fasta format.
  • tracedb-to-frg.pl - converts xml, qual and fasta from the NCBI TraceDB into FRG format.
  • sffToCA - converts 454 SFF files into FRG format, optionally searching each read for 'linker' sequence indicating the read is a pair of mated reads.
  • fastqToCA - generates a FRG file that allows direct loading of Illumina FastQ files.

Common record fields

Several records use the same record fields with the same meaning. These are described here, instead of in each record.

  • The action (act) is one of add, update, ignore or delete (A, U, I, D, respectively). The action tells the gatekeeper module what to do with the message.
  • The comment (com) and source (src) fields store a free-format comment string.
  • Lines beginning with the comment character '#' are ignored. Lines inside multi-line record fields (e.g., com:, src:, seq:, qlt:) are never comment lines. An example fragment message:

Batch Message BAT

This message is no longer used and is silently ignored.

The optional batch record can contain comments on the source of the file.

{BAT
bna:<name>
acc:<batchUID>
com:
<comment>
.
} 

Version Message VER

The version record tells the assembler what version of the input specification should be used to read messages from this point forward. Any number of VER records may be present in a single file; the last version record encountered is in effect. Files without a VER record are assumed to comply with version 1 of the specification.

The writer will (usually) write to the latest specification version. There are numerous exceptions.

{VER
ver:<version-number>
} 

Library Message LIB

The library record describes properties about an insert library. Each library has an insert size, specified as a mean (mea) and standard deviation (std).

The orientation field (tag=ori) describes the orientation implied when two fragments are linked; one of innie, outtie, normal or unoriented. A library is I=Innie if the 3' ends of mated reads are closest, O=Outtie if the 5' ends of mated reads are closest, N=Normal if the 3' end of one read is closest to the 5' end of another read, or U=Unoriented if the library does not support mated reads. Link messages for reads in orientation=U libraries would lead to run-time errors. For orientation=N libraries, it is not necessary or possible to specify which read has the 3' end closer to the other's 5' end.

{LIB
act:<action>
acc:<UID of library>
ori:<link orientation>
mea:<mean>
std:<standard deviation>
src:
<free format text, description of the source of this data>
. 
nft:<number of optional features>
fea:
<list of optional features>
.
} 

Library Features

Features address experimental functionality or technology-specific read processing. Some features are specific to one version of Celera Assembler. Features are specified as list of tag=value pairs within the LIB message.

Features lists are supported in the FRG file format version 2 and higher. The list is prefaced by the number-of-features tag (nft). The list begins with the feature tag (fea), is followed by one tag=value pair per line, and is ended by a period on its own line. Behavior is undefined when number of features has an incorrect value. A feature line may include white space immediately before and after the equal sign (=). A feature list may span comment lines that begin with the #-sign.

Example:

nft:17
fea:
forceBOGunitigger=0
isNotRandom=0
doNotTrustHomopolymerRuns=0
doTrim_initialNone=1
doTrim_initialMerBased=0
doTrim_initialFlowBased=0
doTrim_initialQualityBased=0
doRemoveDuplicateReads=0
doTrim_finalLargestCovered=1
doTrim_finalEvidenceBased=0
doTrim_finalBestEdge=0
doRemoveSpurReads=1
doRemoveChimericReads=1
doCheckForSubReads=1
doConsensusCorrection=0
forceShortReadFormat=0
constantInsertSize=0
.

forceBOGunitigger
DEPRECATED - use the runCA unitigger option.
If set, use the 'bog' unitigger instead of the default 'utg' unitigger.

doNotTrustHomopolymerRuns
DEPRECATED - 454 reads have improved in quality so that the expense of the 'mer' overlapper is not worth the slight increase in overlap quality.
In the 'mer' overlapper, treat unaligned bases within 1-base (homopolymer) repeats as matches.

doTrim_initialNone
During Overlap Based Trimming, do no initial QV based trimming.

doTrim_initialMerBased
During Overlap Based Trimming, use kmer evidence for an initial trimming.

doTrim_initialFlowBased
DEPRECATED - equivalent to doTrim_initialNone.

doTrim_initialQualityBased
During Overlap Based Trimming, use a windowed QV average to initally trim the read.

doRemoveDuplicateReads
Remove inexact duplicate reads and apparent duplicate mates.
Strongly recommended for 454 reads.
Of slight value for Illumina libraries.
Of no value for Sanger and PacBio libraries.

doTrim_finalLargestCovered
After overlaps are computed, trim the read to the largest region covered by overlaps.
Suggested for all but Sanger reads.

doTrim_finalEvidenceBased
After overlaps are computed, trim the read based on Sanger sequencing specific heuristics.

doTrim_finalBestEdge
EXPERIMENTAL algorithm to trim the read so that the thickest overlap possible is generated. This algorithm is known to have flaws.

doRemoveSpurReads
During Overlap Based Trimming, delete or fix spur reads.

doRemoveChimericReads
During Overlap Based Trimming, delete or fix suspected chimeric reads.
Recommended for Sanger, 454 and PacBio libraries.

doCheckForSubReads
Detect a specific overlap pattern indicating the presence of PacBio adapter-linked subreads.
Recommended for uncorrected PacBio libraries.

doConsensusCorrection
Used during pacBioToCA. No user servicable parts. For internal use only.
Polish error-prone reads (e.g. PacBio reads) using other reads (e.g. Illumina reads).

forceShortReadFormat
Store short reads in a slightly more efficient data structure in the gkpStore. Reads longer than 160bp will be truncated and an error generated.

isNotRandom
The reads are ignored by the coverage assessment, so pile-ups of these reads will not contribute to the perceived repetitiveness of their unitigs. This should be set for library constructions that are not random samples of the genome.

constantInsertSize
Do not allow the library insert size to be re-estimated during the assembly.

Four additional features are generated by fastqToCA to specify the location and type of reads in FASTQ files. Please use fastqToCA to generate these.

fastqQualityValues
fastqOrientation
fastqReads
fastqMates

Fragment Message FRG

The fragment record is the primary input record type. It contains the sequence, quality values and ancillary data for each fragment to be used in the assembly.

The random fragment flag (rnd) indicates if this read is truly a random read. Its' value is 0 if the read is not random, and 1 otherwise. All libraries are assumed to generate random fragments (unless a library feature tells otherwise), this flag allows specific reads to be isolated as non-random.

The status code (sta) describes the overall health of the read, e.g., good, E. coli contamination, BAC vector, etc.:

good read
short clear range, some evidence of high signal
3730xl sentinel base call of NNNNN, basically nothing there
low signal, short clear range, and/or trace tuner failure
some signals high, others low, and short clear range
BAC vector screening trash
E. coli contamination screening trash
Insert is only poly A/T, or insert is fewer base pairs than the project threshold (usually 100 base pairs)
Rearrangement of vector. The clear range of the read following vector trimming includes evidence of vector sequence

The library (lib) is the UID of the insert library this read comes from. The plate (pla) and plate location (loc) provide some level of tracking where this read was sequenced.

The sequence, quality values and homo-polymer run/peak spacing are supplied as multi-line character strings. No limit on line length is assumed. Quality values are encoded in ASCII by adding 48 (ASCII zero) to the quality value. Homo-polymer run/peak spacing data is optional.

NOT UP TO DATE; clear ranges change in v6.x

There are three clear ranges associated with each fragment, the vector, quality and final clear range. The vector clear range (clv) indicates the portion of the read which is free from known sequencing vector. The quality clear range (clq) indicates the portion of the read which is of high sequencing quality. The final clear range (clr) is a combination of the vector and quality ranges, usually the intersection. Several special cases exist for clear ranges:

  1. To indicate that either the vector or quality clear range is not known, omit the clv: or clq: entry.
  2. To indicate no sequence is in a clear range, any begin >= end may be used, with 0,0 being the standard. For example, "clr:0,0".
{FRG
act:A
acc:<read UID>
rnd:<random read flag>
sta:<status code>
lib:<library UID>
pla:<plate UID>
loc:<plate location code>
src:
<free format text, description of the source of this data>
.
seq:
<sequence>
.
qlt:
<quality values>
.
hps:
<homo-polymer run or peak spacing or etc info>
.
clv:<beg,end>
clq:<beg,end>
clr:<beg,end>
} 

Linkage Message LKG

The linkage record contains a pair of fragments that are linked. Each fragment will indicate which library it is from, and the library will describe the links (orientation, type) themselves. Fragments must be from the same library to be linked.

{LKG
act:<action>
frg:<fragment-uid>
frg:<fragment-uid>
}

Placement Message PLC

Available in CA 6.0. The placement record constrains a single read to be placed between a pair of fragments. The constraint will be used to place fragments/unitigs for gap filling as well as for surrogate resolution.

{PLC
act:<action>
frg:<constrained fragment-uid>
frg:<border fragment 1   -uid>
frg:<border fragment 2   -uid>
}

A Short Example FRG file

{VER
ver:2
}
{LIB
act:A
acc:.
ori:I
mea:5821
std:1513
src:
.
nft:0
fea:
.
}
#  This is a comment line
{FRG
act:A
acc:334369678
rnd:1
sta:G
lib:.
pla:0
loc:0
src:
#  Not a comment; this is annotation about where the fragment came from
.
seq:
ATGATCGGCAGTGAATTGTATACGACTCACTATAGGGCGAATTGGAGCTCCACGCGGTGGCGGCCGCTCTAGAACTAGTGGATCCCCCGGGCTGCAGGAA
TTCGATTAGGTGGAGGCCACGCTGCGCGACCCCAGCGCCCAGTCCGTAACGCACGTGCTGCAGGCAGGTGCCGGTCAGTGTGTGTGTGGTGGGGGCGGCG
GCAGGGGGGTTGCGTACAGCATGGTGCTTGAAATTGGAAAGGAAGGAAGTCAGCCGTCAATGGAAGACACGAGTTAGTGCGGGCTTGCCCACATCATTGG
CTGTGTATGGGGGGGGCGGTCATGGCTCAGAACGGAGTGATTACAGGCGCCATAGGCCGCCTGGCACAGCTTGACACAGGAGCACTCCCGCATGCATGCA
CTGTCTCTGTCAGGTGTGACAGAGACAGTGTCACACCTGACATGCCGTGTTGCTCTCCTGTGTGTCCGGTGCCGCAGGAGCGCTCGCGCAAGCTGTCCTC
GGACGTCAGCTCGCTCAAGCGCCAGCTGGGGGAGCGCGACAAGCAGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGC
GTGTGCGTGCGTGTGTGTGCGTGTGCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGCGTGAGACGGAAAGAGCCAAG
AAGAGCGCGAACTAAAGGAACAACATGGAAATAGGCGCGGCACCAAAGGTGAACCCTGGGCAACCCCATGGAATCCACAGGGAATCCCGTGTAAACCAAG
GGACCTGAGGAGAGCACCAACAAGATCAGACGANNA
.
qlt:
555566;;;666;;<<<;;<<?CDDDB?<??<<<AADDHHHPVSUUKKG;98:<<>>=???B=;;=>@CDDB?BEDDDIKDVVVKKDDDDDKKKSNNQXP
OLMMMUOPPPSNQJJKKKKKQbXNNPWJJJKKDHEEESYLLFGFFLbb^^^^WWW\\\^\\XXX[NQSYYSSSSSSJJTTT[[dZZZYY[gg[[[[[XXR
[YTGGGGGW`YYYYYRRRRR[YYY[dVdd\YP``PPSMMPPPPMMNSZZ```````\[YYYYdgggggggddgdddbb``gggdbZZZ\gggggggggg`
dddddddd``g`gg`````ggg`g`ggdd````````Z`g``bZZZgggggg`````g````````````Z\\ZZ`d``gg```dgddd````g``gg``
gggggggggd`````dddd``ZZZ``````ddddddddZ``dggg`\ZZZZZ```d`````ZZ`Z\ZZZZ````````````dgg``g``gg[````gdZ
ZZZZZdZZZYY`````gg`gg`````P`ggggg````````[gSXXgg``dVVVYT[][[[XXXggggggg]][ggggggggggg[[[ggggggggZSYY
YYOOOOOO[[[[^^^^^^^^^^VVVQQPSPKKMEDD>DDJDGJEEGJJIDDEEEECAAHFGGJJJJLPLL<<;<<HE@::88786666667866667966
6666877778744696657544466664546699877766667667<<766766778888866666789988868666886666666866677787778<
9:99:8876666678776667666669987575005
.
hps:
.
clr:0,836
}
{FRG
act:A
acc:334370061
rnd:1
sta:G
lib:.
pla:0
loc:0
src:
.
seq:
ACTCAGCCTAAATACCTCACTAAGGGAACAAAGCTGGTACGGGCCCCCCCTCGAGGTCGACGGTATCGATAAGCTTGATCGGCTGGTCCCATTCGCCTTC
CCATTCCAATTCCCGTATTCCCATCCCCACTCCGATCCCCATTCGCAGATTCCCATTCCCATATTCACCATTCCCAGCCCCAGGCCACGCACCAGCGAGC
CCGAGAGCTCCGGCAGCAGCAGCGCAGCGGAGCCGCTCGGCGACATCCCCGCCGCCGCCCCGCCCAGCAGCTGCGACTGCGACGGCTGCGAGCCCGAGCT
CGAGCCCGTGAAGCCGCCTCCCGCCGCCGCAGCCGCGCCCCGCCCGCCTCCTCCGCCTCCGCCTGCGCCTCCGCCGGTGGCGTGCGTGGCTGCTGCTGTG
GCGAGATGCTCCTCCAGCTGCGCCACCAGCTGTGCCCGGTGCGCCAGGTCCGACTCCAGCGCCCGGATCTTGGAGCCCAGCTCGCCGATCTGCGGCGTGG
AGCCGTGGGTTGGTTGCGCGGTCCTCAGGGTCCCGTGGGGGTGATCAGTTGCATACCCGTGGGGATGCCATGGGGGATGGCGCAGGGTTCGACCGTGTGG
AGGGCGGGCGCAGAACCAGGGCGCAGGCACTAAGGCGCGCGCATCATGGGN
.
qlt:
6689;;6687;>BG>?<??;:9??>NL?;::?9><??<??<::???G@C>888;;AGGGHKKKKKKHHKKKKPCCCCCASK=C=??COM[[bQS]bbbUU
UbbbbbGGCCCCCCFLCFKKFFMSSSbbVVVVKGGGGGOOOOOMUUVVIIIIGGMMMKIKLULIKLbGGLLKKMMMUUUVSVSKKMVVNNNNNNNNSKKG
KHHNNNGKKKKKSVVVSSS\\VVVXVVVV\\VQKHHHHHNSSRGGGGGKQJDD<;ADBEHJHMPWSSSUUUSSSSVVSSPSXVVKLBJ@JJQXXSQNVbV
NNNNNURQOOHGGCBAA?DKGG?K?GEEJIGC===@@NSKJ=<=B@DDR[\VVNKMMSSVLKNNKQQSWWOOGGEGGDDDDDGVNSSSNKKNNNSVNNSV
VVPOOSUSUUV[[WSSSNSKQJGGEEEGGNGJHQMOOUUUUUQUUNSVSKPKKKVVSQQVVV\XXSSRXVbbVVV\bVSSVSSSUUVUUVVUUUPOOKGE
EEEEEEEEIFHD==?BBDGNOUOVKEAAADDDDEEGGFJIGGJHJGJKMLMJHHKKKNOLVJGB=>@@@>EEEIBIIJMGG><778>ADFJJLLGCCA@>
>==BDGGGG??B===@A>==@??<<<<<<<;999;;BBBBBBBBGB=4440
.
hps:
.
clr:0,651
}
{LKG
act:A
frg:334370061
frg:334369678
}

Peculiarities

No blank lines. Blank lines are invalid in FRG files. Users have reported that the gatekeeper crashes on such files.