Escherichia coli K12 MG1655, using uncorrected PacBio reads, with CA8.2

From wgs-assembler
Jump to: navigation, search


CA 8 is able to assemble PacBio filtered_subreads directly, without correction.

Download and Prepare the reads

There are two datasets for Escherichia coli K12 MG1655.

Our friends at NBACC released the e.coli reads used in Koren et al. as NCBI SRA project SRP020003 contains reads generated using 454 FLX Titanium, PacBio RS C2 CCS and PacBio RS C2. The full data set is 600x coverage (11.4 GB), and the reads are (slightly) older technology, so we'll not use them in this example. If we were to use them, we could filter out all the short reads, or sample 100x randomly from all files, or just use one of the libraries.

Pacific Biosciences released 100x of P4-C2 chemistry reads. You can download them directly (6.35 GB) or from the original page. You must have the Pac Bio SMRTpipe software installed to extract the reads as FASTQ.

You can grab the three FASTQ files we made:

or use the following impressively long curl commands:

curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.1.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.1.fastq.xz/download
curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.2.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.2.fastq.xz/download
curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.3.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.3.fastq.xz/download

Once downloaded, a FRG file wrapper can be generated with fastqToCA without uncompressing the files.

fastqToCA \
  -libraryname ecoli \
  -technology pacbio-raw \
  -reads *1.fastq.xz \
  -reads *2.fastq.xz \
  -reads *3.fastq.xz \
> ecoli-untrimmed.frg

Overlap Based Trimming

Compile with AS_READ_MAX_NORMAL_LEN_BITS=15

Create a file ecoli-trim.spec with the following runCA configuration:

merSize               = 19
merThreshold          = 0
merDistinct           = 0.9995
merTotal              = 0.995

doOBT                 = 1
stopAfter             = obt

ovlErrorRate          = 0.40  #  Compute overlaps up to 40% error
obtErrorRate          = 0.40  #  Trim reads using overlaps up to 40% error.  New in CA 8!
cnsErrorRate          = 0.40  #  Needed to allow ovlErrorRate=0.40
cgwErrorRate          = 0.40  #  Needed to allow ovlErrorRate=0.40

ovlConcurrency        = 16
cnsConcurrency        = 4

ovlThreads            = 1
ovlHashBits           = 22
ovlHashBlockLength    = 10000000
ovlRefBlockSize       = 25000

This configuration assumes you have a machine with 16 processors, and at least 32gb memory. RunCA will create 118 overlap jobs that require about 1gb memory each, will run 16 in parallel, and will require approximately 150 CPU hours. In the CA 8.1 example, we used 17-mers; here we're using 19-mers.

runCA -p ecoli-trim -d ecoli-trim -s ecoli-trim.spec ecoli-untrimmed.frg

Since the merDistinct and merTotal are new to CA8, the runCA stdout chatter reports the actual merThreshold used:

Supplied merDistinct 0.9995 with threshold 22 is the smallest.

Read Extraction

Trimming is finished if the file 'ecoli-trim/0-overlaptrim/overlaptrim.success' exists. The trimmed reads are still stored in the gkpStore. A few commands will extract them, and replace the CA supplied UID with the original name.

Once the reads are extracted, you could filter out the short reads. Because we still have about 85x coverage, we'll sample down to the longest 25x using fastq-extractLongestCoverage.pl.

WARNING: This only works if there is exactly one library. All reads are extracted into one set of FASTQ files. In CA 8.2, you can use the -withlibname option to gatekeeper to include the library name in the output file name, and so can dump stores with multiple libraries.

NOTE: The CA 8.1 program replaceUIDwithName was renamed to replaceUIDwithName-fastq.

% cd ecoli-trim

% gatekeeper -dumpfastq ecoli-trim ecoli-trim.gkpStore
Scanning store to find libraries used and reads to dump.
Added 0 reads to maintain mate relationships.
Dumping 0 fragments from unknown library (version 1 has these)
Dumping 91388 fragments from library IID 1

% ls -l ecoli-trim.*.fastq
-rw-r--r--  1 bri  staff  775141254 Nov 11 12:40 ecoli-trim.unmated.fastq

% replaceUIDwithName-fastq ecoli-trim.gkpStore.fastqUIDmap ecoli-trim.unmated.fastq 
Renaming 'ecoli-trim.unmated.fastq' to 'ecoli-trim.unmated.fastq.RENAMING'.
Renamed 91388 reads using 91394 names.

% ls -l ecoli-trim.*.fastq
-rw-r--r--  1 bri  staff  775141254 Nov 11 12:40 ecoli-trim.unmated.CA_UIDs.fastq
-rw-r--r--  1 bri  staff  776873478 Nov 11 12:43 ecoli-trim.unmated.fastq

% head -n 1 *fastq
==> ecoli-trim.unmated.CA_UIDs.fastq <==
@100000000001,1 clr=0,551 clv=1,0 max=1,0 tnt=1,0 rnd=t

==> ecoli-trim.unmated.fastq <==
@m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/25/0_593

% fastq-extractLongestCoverage.pl 4639675 25 ecoli-trim.unmated.fastq > ecoli-trim.unmated.longest25x.fastq
sumLength = 115994184
minLength = 8616 (with 10502 reads)
tgtLength = 115991875

% fastqToCA -libraryname ecoli-trimmed-25x -technology pacbio-raw -reads ecoli-trim.unmated.longest25x.fastq > ecoli-trim.unmated.longest25x.frg

Assembly

Create a file ecoli-assembly.spec with the following runCA configuration:

merSize               = 19
merThreshold          = 0
merDistinct           = 0.9995
merTotal              = 0.995

doOBT                 = 0
doExtendClearRanges   = 0

unitigger             = bogart

ovlErrorRate          = 0.35  #  Compute overlaps up to 35% error
utgGraphErrorRate     = 0.35  #  Unitigs at 35% error
utgMergeErrorRate     = 0.35  #  Unitigs at 35% error
cnsErrorRate          = 0.35  #  Needed to allow ovlErrorRate=0.35
cgwErrorRate          = 0.35  #  Needed to allow ovlErrorRate=0.35

ovlConcurrency        = 16
cnsConcurrency        = 16

ovlThreads            = 1
ovlHashBits           = 22
ovlHashBlockLength    = 10000000
ovlRefBlockSize       = 25000

#cnsReduceUnitigs      = 0 0   #  Always use only uncontained reads for consensus
cnsReuseUnitigs       = 1     #  With no mates, no need to redo consensus

cnsMinFrags           = 1000
cnsPartitions         = 256

cnsMaxCoverage        = 10    #  Limit consensus calling to 10x

and then launch the assembly. As was done for trimming above, this is also using 19-mers, compared to the CA 8.1 example where we used 17-mers.

To speed up consensus calling, at expense of quality, cnsMaxCoverage will sample reads to achieve approximately 10x coverage.

runCA -p ecoli-assembly -d ecoli-assembly -s ecoli-assembly.spec ecoli-trim/ecoli-trim.unmated.longest25x.frg

By selecting only the longest 25x of reads, only 12 overlap jobs are created. As during trimming, only 1gb memory is needed for each job. The overlaps need approximately 20 CPU hours.

Unitig consensus requires 16gb memory (8gb might be sufficient) and 11 CPU hours.

Results

% cd ecoli-assembly
% dotplot.sh CTG NC_000913.fasta 9-terminator/ecoli-assembly.ctg.fasta 
% mergeqc.pl -wiki *qc > qc.wiki

Contigs

Files
ecoli-assembly.qc
Scaffolds
TotalScaffolds 1
TotalContigsInScaffolds 1
MeanContigsPerScaffold 1.00
MinContigsPerScaffold 1
MaxContigsPerScaffold 1
TotalBasesInScaffolds 4737179
MeanBasesInScaffolds 4737179
MinBasesInScaffolds 4737179
MaxBasesInScaffolds 4737179
N25ScaffoldBases 4737179
N50ScaffoldBases 4737179
N75ScaffoldBases 4737179
ScaffoldAt1000000 4737179
TotalSpanOfScaffolds 4737179
MeanSpanOfScaffolds 4737179
MinScaffoldSpan 4737179
MaxScaffoldSpan 4737179
IntraScaffoldGaps 0
2KbScaffolds 1
2KbScaffoldSpan 4737179
MeanSequenceGapLength 0
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 1,4737179,4737179,4737179,0,100000010510
total 1,4737179,4737179,4737179,0
Contigs
TotalContigsInScaffolds 1
TotalBasesInScaffolds 4737179
TotalVarRecords 0
MeanContigLength 4737179
MinContigLength 4737179
MaxContigLength 4737179
N25ContigBases 4737179
N50ContigBases 4737179
N75ContigBases 4737179
ContigAt1000000 4737179
BigContigs_greater_10000
TotalBigContigs 1
BigContigLength 4737179
MeanBigContigLength 4737179
MinBigContig 4737179
MaxBigContig 4737179
BigContigsPercentBases 100.00
SmallContigs
TotalSmallContigs 0
SmallContigLength 0
MeanSmallContigLength 0
MinSmallContig 0
MaxSmallContig 0
SmallContigsPercentBases 0.00
DegenContigs
TotalDegenContigs 0
DegenContigLength 0
DegenVarRecords 0
MeanDegenContigLength 0
MinDegenContig 0
MaxDegenContig 0
DegenPercentBases 0.00
Top5Contigs=reads,bases,EUID
0 10408,4737179,100000010509
total 10408,4737179
UniqueUnitigs
TotalUUnitigs 1
MinUUnitigLength 6478698
MaxUUnitigLength 6478698
MeanUUnitigLength 6478698
SDUUnitigLength 0
Surrogates
TotalSurrogates 0
SurrogateInstances 0
SurrogateLength
SurrogateInstanceLength 0
UnPlacedSurrReadLen 0
PlacedSurrReadLen 0
MinSurrogateLength 0
MaxSurrogateLength 0
MeanSurrogateLength 0
SDSurrogateLength 0
Mates
ReadsWithNoMate 10507(100.00%)
ReadsWithGoodMate 0(0.00%)
ReadsWithBadShortMate 0(0.00%)
ReadsWithBadLongMate 0(0.00%)
ReadsWithSameOrientMate 0(0.00%)
ReadsWithOuttieMate 0(0.00%)
ReadsWithBothChaffMate 0(0.00%)
ReadsWithChaffMate 0(0.00%)
ReadsWithBothDegenMate 0(0.00%)
ReadsWithDegenMate 0(0.00%)
ReadsWithBothSurrMate 0(0.00%)
ReadsWithSurrogateMate 0(0.00%)
ReadsWithDiffScafMate 0(0.00%)
ReadsWithUnassignedMate 0(0.00%)
TotalScaffoldLinks 0
MeanScaffoldLinkWeight 0.00
Reads
TotalReadsInput NA
TotalUsableReads 10507
AvgClearRange 11044
ContigReads 10408(99.06%)
BigContigReads 10408(99.06%)
SmallContigReads 0(0.00%)
DegenContigReads 0(0.00%)
SurrogateReads 0(0.00%)
PlacedSurrogateReads 0(0.00%)
SingletonReads 99(0.94%)
ChaffReads 99(0.94%)
Coverage
ContigsOnly 24.26
Contigs_Surrogates 24.26
Contigs_Degens_Surrogates 24.26
AllReads 24.50
TotalBaseCounts
BasesCount NA
ClearRangeLengthFRG NA
ClearRangeLengthASM 116037264
SurrogateBaseLength 0
ContigBaseLength 114934554
DegenBaseLength 0
SingletonBaseLength 1102710
Contig_SurrBaseLength 114934554
gcContent
Content 37.02
Unitig Consensus
NumColumnsInUnitigs 22739353
NumGapsInUnitigs 5208852
NumRunsOfGapsInUnitigReads 36163443