Escherichia coli K12 MG1655, using uncorrected PacBio reads, with CA8.1

From wgs-assembler
Jump to: navigation, search


CA 8 is able to assemble PacBio filtered_subreads directly, without correction.

Download and Prepare the reads

There are two datasets for Escherichia coli K12 MG1655.

Our friends at NBACC released the e.coli reads used in Koren et al. as NCBI SRA project SRP020003 contains reads generated using 454 FLX Titanium, PacBio RS C2 CCS and PacBio RS C2. The full data set is 600x coverage (11.4 GB), and the reads are (slightly) older technology, so we'll not use them in this example. If we were to use them, we could filter out all the short reads, or sample 100x randomly from all files, or just use one of the libraries.

Pacific Biosciences released 100x of P4-C2 chemistry reads. You can download them directly (6.35 GB) or from the original page. You must have the Pac Bio SMRTpipe software installed to extract the reads as FASTQ.

You can grab the three FASTQ files we made:

or use the following impressively long curl commands:

curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.1.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.1.fastq.xz/download
curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.2.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.2.fastq.xz/download
curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.3.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.3.fastq.xz/download

Once downloaded, a FRG file wrapper can be generated with fastqToCA without uncompressing the files.

fastqToCA \
  -libraryname ecoli \
  -technology pacbio-raw \
  -reads *1.fastq.xz \
  -reads *2.fastq.xz \
  -reads *3.fastq.xz \
> ecoli-untrimmed.frg

Overlap Based Trimming

Compile with AS_READ_MAX_NORMAL_LEN_BITS=15

Create a file ecoli-trim.spec with the following runCA configuration:

merSize               = 17
merThreshold          = 0
merDistinct           = 0.9995
merTotal              = 0.995

doOBT                 = 1
stopAfter             = obt

ovlErrorRate          = 0.40  #  Compute overlaps up to 40% error
obtErrorRate          = 0.40  #  Trim reads using overlaps up to 40% error.  New in CA 8!
cnsErrorRate          = 0.40  #  Needed to allow ovlErrorRate=0.40
cgwErrorRate          = 0.40  #  Needed to allow ovlErrorRate=0.40

ovlConcurrency        = 16
cnsConcurrency        = 4

ovlThreads            = 1
ovlHashBits           = 22
ovlHashBlockLength    = 10000000
ovlRefBlockSize       = 25000

This configuration assumes you have a machine with 16 processors, and at least 32gb memory. RunCA will create 118 overlap jobs that require about 1gb memory each, and a total of 175 to 250 CPU hours.

runCA -p ecoli-trim -d ecoli-trim -s ecoli-trim.spec ecoli-untrimmed.frg

Since the merDistinct and merTotal are new to CA8, the runCA stdout chatter reports the actual merThreshold used:

Supplied merDistinct 0.9995 with threshold 22 is the smallest.

Read Extraction

Trimming is finished if the file 'ecoli-trim/0-overlaptrim/overlaptrim.success' exists. The trimmed reads are still stored in the gkpStore. A few commands will extract them, and replace the CA supplied UID with the original name.

Once the reads are extracted, you could filter out the short reads. Because we still have about 85x coverage, we'll sample down to the longest 25x using fastq-extractLongestCoverage.pl.

WARNING: This only works if there is exactly one library. All reads are extracted into one set of FASTQ files.

% cd ecoli-trim

% gatekeeper -dumpfastq ecoli-trim ecoli-trim.gkpStore
Scanning store to find libraries used and reads to dump.
Added 0 reads to maintain mate relationships.
Dumping 0 fragments from unknown library (version 1 has these)
Dumping 91388 fragments from library IID 1

% ls -l ecoli-trim.*.fastq
-rw-rw-r-- 1 bri staff         0 Dec 16 10:36 ecoli-trim.1.fastq
-rw-rw-r-- 1 bri staff         0 Dec 16 10:36 ecoli-trim.2.fastq
-rw-rw-r-- 1 bri staff         0 Dec 16 10:36 ecoli-trim.paired.fastq
-rw-rw-r-- 1 bri staff 775031926 Dec 16 10:37 ecoli-trim.unmated.fastq

% rm -f ecoli-trim.1.fastq ecoli-trim.2.fastq ecoli-trim.paired.fastq

% replaceUIDwithName ecoli-trim.gkpStore.fastqUIDmap ecoli-trim.unmated.fastq 
Renaming 'ecoli-trim.unmated.fastq' to 'ecoli-trim.unmated.fastq.RENAMING'.
Renamed 91388 reads using 91394 names.

% ls -l ecoli-trim.*.fastq
-rw-rw-r-- 1 bri staff  775031926 Dec 16 10:37 ecoli-trim.unmated.CA_UIDs.fastq
-rw-rw-r-- 1 bri staff  776764080 Dec 16 10:39 ecoli-trim.unmated.fastq

% head -n 1 *fastq
==> ecoli-trim.unmated.CA_UIDs.fastq <==
@100000000001,1 clr=0,551 clv=1,0 max=1,0 tnt=1,0 rnd=t

==> ecoli-trim.unmated.fastq <==
@m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/25/0_593

% fastq-extractLongestCoverage.pl 4639675 25 ecoli-trim.unmated.fastq > ecoli-trim.unmated.longest25x.fastq
sumLength = 115995795
minLength = 8615 (with 10507 reads)
tgtLength = 115991875

% fastqToCA -libraryname ecoli-trimmed-25x -technology pacbio-raw -reads ecoli-trim.unmated.longest25x.fastq > ecoli-trim.unmated.longest25x.frg

Assembly

Create a file ecoli-assembly.spec with the following runCA configuration:

merSize               = 17
merThreshold          = 0
merDistinct           = 0.9995
merTotal              = 0.995

doOBT                 = 0
doExtendClearRanges   = 0

unitigger             = bogart

ovlErrorRate          = 0.35  #  Compute overlaps up to 35% error
utgGraphErrorRate     = 0.35  #  Unitigs at 35% error
utgMergeErrorRate     = 0.35  #  Unitigs at 35% error
cnsErrorRate          = 0.35  #  Needed to allow ovlErrorRate=0.35
cgwErrorRate          = 0.35  #  Needed to allow ovlErrorRate=0.35

ovlConcurrency        = 16
cnsConcurrency        = 16

ovlThreads            = 1
ovlHashBits           = 22
ovlHashBlockLength    = 10000000
ovlRefBlockSize       = 25000

#cnsReduceUnitigs      = 0 0   #  Always use only uncontained reads for consensus
cnsReuseUnitigs       = 1     #  With no mates, no need to redo consensus

cnsMinFrags           = 1000
cnsPartitions         = 256

and then launch the assembly.

runCA -p ecoli-assembly -d ecoli-assembly -s ecoli-assembly.spec ecoli-trim/ecoli-trim.unmated.longest25x.frg

By selecting only the longest 25x of reads, only 12 overlap jobs are created. As during trimming, only 1gb memory is needed for each job. The overlaps need only 25 to 31 CPU hours.

Unitig consensus requires 16gb memory and 11 CPU hours.

Results

% cd ecoli-assembly
% dotplot.sh CTG NC_000913.fasta 9-terminator/ecoli-assembly.ctg.fasta 
% mergeqc.pl -wiki *qc > qc.wiki

Contigs

Files
ecoli-assembly/ecoli-assembly.qc
Scaffolds
TotalScaffolds 1
TotalContigsInScaffolds 1
MeanContigsPerScaffold 1.00
MinContigsPerScaffold 1
MaxContigsPerScaffold 1
TotalBasesInScaffolds 4736059
MeanBasesInScaffolds 4736059
MinBasesInScaffolds 4736059
MaxBasesInScaffolds 4736059
N25ScaffoldBases 4736059
N50ScaffoldBases 4736059
N75ScaffoldBases 4736059
ScaffoldAt1000000 4736059
TotalSpanOfScaffolds 4736059
MeanSpanOfScaffolds 4736059
MinScaffoldSpan 4736059
MaxScaffoldSpan 4736059
IntraScaffoldGaps 0
2KbScaffolds 1
2KbScaffoldSpan 4736059
MeanSequenceGapLength 0
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 1,4736059,4736059,4736059,0,7180000000002
total 1,4736059,4736059,4736059,0
Contigs
TotalContigsInScaffolds 1
TotalBasesInScaffolds 4736059
TotalVarRecords 0
MeanContigLength 4736059
MinContigLength 4736059
MaxContigLength 4736059
N25ContigBases 4736059
N50ContigBases 4736059
N75ContigBases 4736059
ContigAt1000000 4736059
BigContigs_greater_10000
TotalBigContigs 1
BigContigLength 4736059
MeanBigContigLength 4736059
MinBigContig 4736059
MaxBigContig 4736059
BigContigsPercentBases 100.00
SmallContigs
TotalSmallContigs 0
SmallContigLength 0
MeanSmallContigLength 0
MinSmallContig 0
MaxSmallContig 0
SmallContigsPercentBases 0.00
DegenContigs
TotalDegenContigs 0
DegenContigLength 0
DegenVarRecords 0
MeanDegenContigLength 0
MinDegenContig 0
MaxDegenContig 0
DegenPercentBases 0.00
Top5Contigs=reads,bases,EUID
0 10435,4736059,7180000000001
total 10435,4736059
UniqueUnitigs
TotalUUnitigs 1
MinUUnitigLength 7532692
MaxUUnitigLength 7532692
MeanUUnitigLength 7532692
SDUUnitigLength 0
Surrogates
TotalSurrogates 0
SurrogateInstances 0
SurrogateLength
SurrogateInstanceLength 0
UnPlacedSurrReadLen 0
PlacedSurrReadLen 0
MinSurrogateLength 0
MaxSurrogateLength 0
MeanSurrogateLength 0
SDSurrogateLength 0
Mates
ReadsWithNoMate 10507(100.00%)
ReadsWithGoodMate 0(0.00%)
ReadsWithBadShortMate 0(0.00%)
ReadsWithBadLongMate 0(0.00%)
ReadsWithSameOrientMate 0(0.00%)
ReadsWithOuttieMate 0(0.00%)
ReadsWithBothChaffMate 0(0.00%)
ReadsWithChaffMate 0(0.00%)
ReadsWithBothDegenMate 0(0.00%)
ReadsWithDegenMate 0(0.00%)
ReadsWithBothSurrMate 0(0.00%)
ReadsWithSurrogateMate 0(0.00%)
ReadsWithDiffScafMate 0(0.00%)
ReadsWithUnassignedMate 0(0.00%)
TotalScaffoldLinks 0
MeanScaffoldLinkWeight 0.00
Reads
TotalReadsInput NA
TotalUsableReads 10507
AvgClearRange 11040
ContigReads 10435(99.31%)
BigContigReads 10435(99.31%)
SmallContigReads 0(0.00%)
DegenContigReads 0(0.00%)
SurrogateReads 0(0.00%)
PlacedSurrogateReads 0(0.00%)
SingletonReads 72(0.69%)
ChaffReads 72(0.69%)
Coverage
ContigsOnly 24.32
Contigs_Surrogates 24.32
Contigs_Degens_Surrogates 24.32
AllReads 24.49
TotalBaseCounts
BasesCount NA
ClearRangeLengthFRG NA
ClearRangeLengthASM 115995795
SurrogateBaseLength 0
ContigBaseLength 115200711
DegenBaseLength 0
SingletonBaseLength 795084
Contig_SurrBaseLength 115200711
gcContent
Content 31.85
Unitig Consensus
NumColumnsInUnitigs 24920511
NumGapsInUnitigs 8302126
NumRunsOfGapsInUnitigReads 125555342