FastqToCA

From wgs-assembler
Jump to: navigation, search

The native file format for Illumina data is FASTQ. The FASTQ format encodes read identifiers, read sequences, and sequence quality values. The format of the read IDs can encode the mate pairing of reads, but not any information about the orientation or expected distance between the mated reads. Starting with version 6.1, the Celera Assembler can read most variants of FASTQ files.

To provide library information to Celera Assembler, we wrap a set of FASTQ files in a Celera Assembler FRG format LIB message.

Celera Assembler includes a utility, fastqToCA, to generate wrapper LIB messages for FASTQ files. Other utilities include fastqSample, which will randomly sample a FASTQ file to reduce coverage, and fastqSimulate which will generate simulated Illumina SE, PE or MP reads from a reference sequence.

Usage

usage: fastqToCA [-insertsize <mean> <stddev>] [-libraryname <name>]

  -insertsize i d    Mates are on average i +- d bp apart.
                     If the word 'constant' follows the insert size, no changes will be
                     made to the insert size.

  -libraryname n     The UID of the library these reads are added to.

  -technology p      What instrument were these reads generated on ('illumina' is the default):
                       'none'               -- don't set any features; use -feature to set them manually
                       'sanger'             -- reads from dideoxy sequencers
                       '454'                -- reads from 454 Life Sciences; FLX, Titanium, FLX+
                       'illumina'           -- reads from Illumina; GAIIx, MiSeq, HiSeq; shorter than 160bp
                       'illumina-long'      -- reads from Illumina; GAIIx, MiSeq, HiSeq; any length
                       'moleculo'           -- reads from Illumina; Moleculo
                       'pacbio-ccs'         -- reads from PacBio; Circular Consensus Sequence (CSS)
                       'pacbio-corrected'   -- reads from PacBio; corrected reads from pacBioToCA
                       'pacbio-raw'         -- reads from PacBio; uncorrected reads

  -type t            What type of fastq ('sanger' is the default):
                       'sanger'   -- QV's are PHRED, offset=33 '!', NCBI SRA data.
                       'solexa'   -- QV's are Solexa, early Solexa data.
                       'illumina' -- QV's are PHRED, offset=64 '@', Illumina reads from version 1.3 on.
                     See Cock, et al., 'The Sanger FASTQ file format for sequences with quality scores, and
                     the Solexa/Illumina FASTQ variants', doi:10.1093/nar/gkp1137

  -innie             The paired end reads are 5'-3' <-> 3'-5' (the usual case) (default)

  -outtie            The paired end reads are 3'-5' <-> 5'-3' (for Illumina Mate Pair reads)
                     This switch will reverse-complement every read, transforming outtie-oriented
                     mates into innie-oriented mates.  This trick only works if all reads are the
                     same length.

  -reads A           Single ended reads, in fastq format.
  -mates A           Mated reads, interlaced, in fastq format.
  -mates A,B         Mated reads, in fastq format.

Library Features

  -nonrandom         Mark the library as containing non-random reads.
  -feature F V       Set feature F to value V.


The -insertsize option is optional. If supplied, the reads are mated, if omitted, the reads are unmated.

The -libraryname option is mandatory. It provides a UID name for the library that the reads are placed into. This name is completely up to the end user, but must contain no white spaces or commas.

The -technology option is optional. It selects the platform the reads were generated on. This sets library feature flags to enable different correction, trimming and unitigging algorithms. The default is 'illumina'.

For each type, the features set are:

none
No features set. Use -feature to set them manually.
sanger
Reads from dideoxy sequencers
454
Reads from 454 Life Sciences; FLX, Titanium, FLX+.
illumina
Reads from Illumina; GAIIx, MiSeq, HiSeq; shorter than 160bp. This enables a slightly more efficient storage scheme in the gkpStore.
illumina-long
Reads from Illumina; GAIIx, MiSeq, HiSeq; any length.
pacbio-ccs
Reads from PacBio; Circular Consensus Sequence (CSS).
pacbio-corrected
Reads from PacBio; corrected reads output from pacBioToCA.
pacbio-raw
Reads from PacBio; uncorrected reads.

The -type option is optional. It selected the type of QV encoding in the fastq file. See Cock et al., "The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants" for details. See also Wikipedia on FASTQ format QV encoding.

One of -innie (default) or -outtie may be supplied to indicate the orientation of mated reads. The 'paired-end' protocol generates innie reads, typically with an insert size of a few hundred bases. The 'mate-pair' protocol generates outtie reads, with an insert size of a few thousand bases.

There are three options for specifying the FASTQ files.

  • For single-ended reads: '-reads FILE'
  • For paired-end or mate-pair reads interleaved in one file: '-mates FILE'. We assume that the 1st and 2nd reads are mated, that the 3rd and 4th reads are mated, and so on.
  • For paired-end or mate-pair reads synchronized between two files: '-mates FILE1,FILE2'. We assume that the 1st read in each file is a pair, that the 2nd read in each file is a pair, and so on.

In all cases, we do NOT use the read name for anything. 'gatekeeper' will generate a mapping of read name to assembly ID in the file 'gkpStore.fastqUIDmap'.

NOTE that since the read information in the FASTQ files is NOT copied into the output FRG file wrapper it is critical to not rename or move the input FASTQ files before the assembly starts.

If -nonrandom is supplied, the reads in this library are marked as being from a non-random sampling of the genome, and will not be used during the coverage based repeat calculation.

Explicit library features can be set with '-feature'.

Example with Unmated Reads

Suppose we have two unmated reads in a single fastq file.

@E100EAS20:6:1:11:652#0/1
ATTGAAGAACGCGAGGCATCGTCTTAACGAGGCACCGAGGCGTCGCATTCTTCAGATGGTTCAACCCTTAAGTTAGCGCTTATGGGAGTAATCCCCGCAT
+
ggggggggffgggdgeeggfggfggggggfgggggdgdeegdeggfgggheggfgeeegdegecf`gggecgfgde\ebedfgg]cdeddcdKca^^YV`
@E100EAS20:6:1:11:112#0/1
CATTAGCGATCATCTCGATCTGTTAGCCAATCACGACTTCCGCACTTTAATGCGCGTCACGCGTCTGAAAGAAGATGTGCTGAAAGAAGCCGTCAATCTG
+
gggfgggffghggfggg_ggggggchgdgfdgggecfggggggggggghdgggghghggfeeg`dce`_Jb^T_]afff]ff_fbfdYfYad^ce[c\c^

Convert these to CA frg format using:

% fastqToCA -libraryname UNMATED -type illumina -reads unmated.fastq > unmated.frg
% cat unmated.frg
{VER
ver:2
}
{LIB
act:A
acc:UNMATED
ori:U
mea:0.000
std:0.000
src:
.
nft:20
fea:
forceBOGunitigger=1
isNotRandom=0
doNotTrustHomopolymerRuns=0
doTrim_initialNone=0
doTrim_initialMerBased=1
doTrim_initialFlowBased=0
doTrim_initialQualityBased=0
doRemoveDuplicateReads=1
doTrim_finalLargestCovered=1
doTrim_finalEvidenceBased=0
doTrim_finalBestEdge=0
doRemoveSpurReads=1
doRemoveChimericReads=1
doCheckForSubReads=0
doConsensusCorrection=0
forceShortReadFormat=1
constantInsertSize=0
fastqQualityValues=illumina
fastqOrientation=innie
fastqReads=/work/wgs-wiki/x/unmated.fastq
.
}
{VER
ver:1
}

We can load this into a gatekeeper store then dump the store to verify the reads are present:

% gatekeeper -T -o unmated.gkpStore unmated.frg

Starting file 'unmated.frg'.

Processing SINGLE-ENDED SANGER QV encoding reads from:
      '/work/wgs-wiki/x/unmated.fastq'


GKP finished with no alerts or errors.

% gatekeeper -dumpfragments -withsequence unmated.gkpStore
fragmentIdent           = 100000000001,1
fragmentMate            = 0,0
fragmentLibrary         = UNMATED,1
fragmentIsDeleted       = 0
fragmentIsNonRandom     = 0
fragmentOrientation     = U
fragmentSeqLen          = 100
fragmentClear           = 0,100
fragmentClear           = LATEST,0,100
fragmentClear           = CLR,0,100
fragmentSequence      = ATTGAAGAACGCGAGGCATCGTCTTAACGAGGCACCGAGGCGTCGCATTCTTCAGATGGTTCAACCCTTAAGTTAGCGCTTATGGGAGTAATCCCCGCAT
fragmentQuality       = WWWWWWWWVVWWWTWUUWWVWWVWWWWWWVWWWWWTWTUUWTUWWVWWWXUWWVWUUUWTUWUSVPWWWUSWVWTULURUTVWWMSTUTTST;SQNNIFP
fragmentSeqOffset       = 0
fragmentQltOffset       = 0
fragmentIdent           = 100000000002,2
fragmentMate            = 0,0
fragmentLibrary         = UNMATED,1
fragmentIsDeleted       = 0
fragmentIsNonRandom     = 0
fragmentOrientation     = U
fragmentSeqLen          = 100
fragmentClear           = 0,100
fragmentClear           = LATEST,0,100
fragmentClear           = CLR,0,100
fragmentSequence      = CATTAGCGATCATCTCGATCTGTTAGCCAATCACGACTTCCGCACTTTAATGCGCGTCACGCGTCTGAAAGAAGATGTGCTGAAAGAAGCCGTCAATCTG
fragmentQuality       = WWWVWWWVVWXWWVWWWOWWWWWWSXWTWVTWWWUSVWWWWWWWWWWWXTWWWWXWXWWVUUWPTSUPO:RNDOMQVVVMVVOVRVTIVIQTNSUKSLSN
fragmentSeqOffset       = 0
fragmentQltOffset       = 0

% cat unmated.gkpStore.fastqUIDmap
100000000001    1       E100EAS20:6:1:11:652#0/1
100000000002    2       E100EAS20:6:1:11:112#0/1


Example with Mated Reads

Mated reads must be in two files. Corresponding fragments in each file are mated; the first fragment in file A is mated to the first fragment in file B, etc.

The first file contains one fragment:

@E100EAS20:6:1:6:1021#0/1
GTATTTTCAAGCCTGGCTTGTTGCAAACAATGTATAAAGCACTTAGGCAATAATAATTACATTCAGCAACTATCATCATCGGTATTGTTTGTGGGCGGAA
+
ggfggggggfcgagggegggfgffgdg_efgcedgfgegdfdggffgfdgggfggfgfffgggdegffbfgghefheege^f`behfeegffgafVg_dZ

The second file contains the mate:

@E100EAS20:6:1:6:1021#0/2
TACTACTAATTCTCAAATAGTCTTTTTCCATAAAGCTACACCAATCTGTAGGTTGTAGATCCCTTTCTATTATTAATAGATATAAGAACTGTACTGTTGT
+
eggggfggdgggggggfgfgdgggggffghgggggg`gfggggffggg_gdVfcgfd_\gggggeeegdgegg`gaggddafggfdbcggg`edggbfff

To convert to fastq, the "-insertsize" option is needed, and the "-fastq" option now needs both fastq files separated by a comma:

% fastqToCA -libraryname MATED -type illumina -insertsize 200 20 -mates a.fastq,b.fastq > ab.frg
% cat ab.frg
{VER
ver:2
}
{LIB
act:A
acc:TEST
ori:I
mea:200.000
std:20.000
src:
.
nft:20
fea:
forceBOGunitigger=1
isNotRandom=0
doNotTrustHomopolymerRuns=0
doTrim_initialNone=0
doTrim_initialMerBased=1
doTrim_initialFlowBased=0
doTrim_initialQualityBased=0
doRemoveDuplicateReads=1
doTrim_finalLargestCovered=1
doTrim_finalEvidenceBased=0
doTrim_finalBestEdge=0
doRemoveSpurReads=1
doRemoveChimericReads=1
doCheckForSubReads=0
doConsensusCorrection=0
forceShortReadFormat=1
constantInsertSize=0
fastqQualityValues=illumina
fastqOrientation=innie
fastqMates=/work/wgs-wiki/x/a.fastq,/work/wgs-wiki/x/b.fastq
.
}
{VER
ver:1
}
% gatekeeper -T -o ab.gkpStore ab.frg

Starting file 'ab.frg'.

Processing INNIE SANGER QV encoding reads from:
      '/work/wgs-wiki/x/a.fastq'
  and '/work/wgs-wiki/x/b.fastq'


GKP finished with no alerts or errors.

% gatekeeper -dumpfragments -withsequence ab.gkpStore

fragmentIdent           = 110000000001,1
fragmentMate            = 120000000001,2
fragmentLibrary         = MATED,1
fragmentIsDeleted       = 0
fragmentIsNonRandom     = 0
fragmentOrientation     = I
fragmentSeqLen          = 100
fragmentClear           = 0,100
fragmentClear           = LATEST,0,100
fragmentClear           = CLR,0,100
fragmentSequence      = GTATTTTCAAGCCTGGCTTGTTGCAAACAATGTATAAAGCACTTAGGCAATAATAATTACATTCAGCAACTATCATCATCGGTATTGTTTGTGGGCGGAA
fragmentQuality       = WWVWWWWWWVSWQWWWUWWWVWVVWTWOUVWSUTWVWUWTVTWWVVWVTWWWVWWVWVVVWWWTUWVVRVWWXUVXUUWUNVPRUXVUUWVVWQVFWOTJ
fragmentSeqOffset       = 0
fragmentQltOffset       = 0
fragmentIdent           = 120000000001,2
fragmentMate            = 110000000001,1
fragmentLibrary         = MATED,1
fragmentIsDeleted       = 0
fragmentIsNonRandom     = 0
fragmentOrientation     = I
fragmentSeqLen          = 100
fragmentClear           = 0,100
fragmentClear           = LATEST,0,100
fragmentClear           = CLR,0,100
fragmentSequence      = TACTACTAATTCTCAAATAGTCTTTTTCCATAAAGCTACACCAATCTGTAGGTTGTAGATCCCTTTCTATTATTAATAGATATAAGAACTGTACTGTTGT
fragmentQuality       = UWWWWVWWTWWWWWWWVWVWTWWWWWVVWXWWWWWWPWVWWWWVVWWWOWTFVSWVTOLWWWWWUUUWTWUWWPWQWWTTQVWWVTRSWWWPUTWWRVVV
fragmentSeqOffset       = 0
fragmentQltOffset       = 0

% cat ab.gkpStore.fastqUIDmap
110000000001    1       E100EAS20:6:1:6:1021#0/1        120000000001    2       E100EAS20:6:1:6:1021#0/2