Escherichia coli K12 MG1655, using uncorrected PacBio reads, with CA8

From wgs-assembler
Jump to: navigation, search


CA 8 is able to assemble PacBio filtered_subreads directly, without correction.

Download and Prepare the reads

There are two datasets for Escherichia coli K12 MG1655.

Our friends at NBACC released the e.coli reads used in Koren et al. as NCBI SRA project SRP020003 contains reads generated using 454 FLX Titanium, PacBio RS C2 CCS and PacBio RS C2. The full data set is 600x coverage (11.4 GB), and the reads are (slightly) older technology, so we'll not use them in this example. If we were to use them, we could filter out all the short reads, or sample 100x randomly from all files, or just use one of the libraries.

Pacific Biosciences released 100x of P4-C2 chemistry reads. You can download them directly (6.35 GB) or from the original page. You must have the Pac Bio SMRTpipe software installed to extract the reads as FASTQ.

You can grab the three FASTQ files we made:

or use the following impressively long curl commands:

curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.1.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.1.fastq.xz/download
curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.2.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.2.fastq.xz/download
curl -L -o escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.3.fastq.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/escherichia_coli_k12_mg1655.m130404_014004_sidney_c100506902550000001823076808221337_s1_p0.3.fastq.xz/download

Once downloaded, a FRG file wrapper can be generated with fastqToCA without uncompressing the files.

fastqToCA \
  -libraryname ecoli \
  -technology pacbio-raw \
  -reads *1.fastq.xz \
  -reads *2.fastq.xz \
  -reads *3.fastq.xz \
> ecoli-untrimmed.frg

Overlap Based Trimming

Compile with AS_READ_MAX_NORMAL_LEN_BITS=15

Create a file ecoli-trim.spec with the following runCA configuration:

#useGrid               = 1
#scriptOnGrid          = 1
#
#sge                   = 
#sgeScript             = -pe thread 1 -l memory=8g
#sgeOverlap            = -pe thread 1 -l memory=2g
#sgeConsensus          = -pe thread 1 -l memory=6g

ovlConcurrency        = 4

merylMemory           = 0 -segments 8 -threads 8
merylThreads          = 8

merSize               = 17
merThreshold          = 
merDistinct           = 0.9995
merTotal              = 0.995

doOBT                 = 1
stopAfter             = obt

ovlErrorRate          = 0.35  #  Compute overlaps up to 40% error
obtErrorRate          = 0.35  #  Trim reads using overlaps up to 40% error.  New in CA 8!
cnsErrorRate          = 0.35  #  Needed to allow ovlErrorRate=0.40
cgwErrorRate          = 0.35  #  Needed to allow ovlErrorRate=0.40

ovlThreads            = 1
ovlHashBits           = 22
ovlHashBlockLength    = 10000000
ovlRefBlockSize       = 25000

saveOverlaps          = 1

This configuration assumes you have a machine with 16 processors, and at least 32gb memory. If you'd rather use SGE, uncomment the lines at the start. Then launch runCA to do the trimming. It will create 118 overlap jobs that require about 150 CPU hours.

runCA -p ecoli-trim -d ecoli-trim -s ecoli-trim.spec ecoli-untrimmed.frg

If gatekeeper fails and complains about 'seq longer than longer than gkpShortReadLength bases, truncating', then you did not compile for long read support.

Buried in the runCA stdout chatter is:

Supplied merDistinct 0.9995 with threshold 22 is the smallest.

Read Extraction

Once runCA completes, the trimmed reads are still stored in the gkpStore. A few commands will extract them, and replace the CA supplied UID with the original name.

WARNING: This only works if there is exactly one library. All reads are extracted into one set of FASTQ files.

% cd ecoli-trim

% gatekeeper -dumpfastq ecoli-trim ecoli-trim.gkpStore
Scanning store to find libraries used and reads to dump.
Added 0 reads to maintain mate relationships.
Dumping 0 fragments from unknown library (version 1 has these)
Dumping 91122 fragments from library IID 1

% ls -l ecoli-trim.*.fastq
-rw-r--r--  1 bri  staff          0 Oct 31 16:09 ecoli-trim.1.fastq
-rw-r--r--  1 bri  staff          0 Oct 31 16:09 ecoli-trim.2.fastq
-rw-r--r--  1 bri  staff          0 Oct 31 16:09 ecoli-trim.paired.fastq
-rw-r--r--  1 bri  staff  732898036 Oct 31 16:10 ecoli-trim.unmated.fastq

% rm -f ecoli-trim.1.fastq ecoli-trim.2.fastq ecoli-trim.paired.fastq

% replaceUIDwithName ecoli-trim.gkpStore.fastqUIDmap ecoli-trim.unmated.fastq 
Renaming 'ecoli-trim.unmated.fastq' to 'ecoli-trim.unmated.fastq.RENAMING'.
Renamed 91122 reads using 91394 names.

% ls -l ecoli-trim.*.fastq
-rw-r--r--  1 bri  staff  732898036 Oct 31 16:10 ecoli-trim.unmated.CA_UIDs.fastq
-rw-r--r--  1 bri  staff  734627768 Oct 31 16:10 ecoli-trim.unmated.fastq

% head -n 1 *fastq
==> ecoli-trim.unmated.CA_UIDs.fastq <==
@100000000001,1 clr=0,551 clv=1,0 max=1,0 tnt=1,0 rnd=t

==> ecoli-trim.unmated.fastq <==
@m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/25/0_593

% fastqToCA -libraryname ECOLI-TRIM -technology pacbio-raw -reads ecoli-trim.unmated.fastq > ecoli-trim.unmated.frg

Assembly

Create a file ecoli-assembly.spec with the following runCA configuration:

#useGrid               = 1
#scriptOnGrid          = 1
#
#sge                   = 
#sgeScript             = -pe thread 1 -l memory=8g
#sgeOverlap            = -pe thread 1 -l memory=2g
#sgeConsensus          = -pe thread 1 -l memory=6g

ovlConcurrency        = 4
cnsConcurrency        = 16

merylMemory           = 0 -segments 8 -threads 8
merylThreads          = 8

merSize               = 17
merThreshold          = 
merDistinct           = 0.9995
merTotal              = 0.995

doOBT                 = 0
doFragmentCorrection  = 0

ovlErrorRate          = 0.35  #  Compute overlaps up to 35% error
utgGraphErrorRate     = 0.35
utgMergeErrorRate     = 0.35
cnsErrorRate          = 0.35
cgwErrorRate          = 0.35

ovlMinLen             = 250   #  Only trust long overlaps

ovlThreads            = 1
ovlHashBits           = 22
ovlHashBlockLength    = 10000000
ovlRefBlockSize       = 25000

saveOverlaps          = 1

unitigger             = bogart

cnsReduceUnitigs      = 0 0   #  Always use only uncontained reads for consensus
cnsReuseUnitigs       = 1     #  With no mates, no need to redo consensus

cnsMinFrags           = 1000
cnsPartitions         = 256

and then launch the assembly.

runCA -p ecoli-assembly -d ecoli-assembly -s ecoli-assembly.spec ecoli-trim/ecoli-trim.unmated.frg

WARNING: Full consensus generation requires about 55 hours and at least 48gb memory.

Option cnsReduceUnitigs is used to sample the unitig down to the minimal set of reads that will form a unitig, 7x of coverage in this. Consensus accuracy will be degraded, however, this lets utgcns run in three hours using less than 8gb memory.

Results

Files
9-terminator/ecoli-assembly.qc
Scaffolds
TotalScaffolds 5
TotalContigsInScaffolds 5
MeanContigsPerScaffold 1.00
MinContigsPerScaffold 1
MaxContigsPerScaffold 1
TotalBasesInScaffolds 4729398
MeanBasesInScaffolds 945880
MinBasesInScaffolds 1011
MaxBasesInScaffolds 4721155
N25ScaffoldBases 4721155
N50ScaffoldBases 4721155
N75ScaffoldBases 4721155
ScaffoldAt1000000 4721155
TotalSpanOfScaffolds 4729398
MeanSpanOfScaffolds 945880
MinScaffoldSpan 1011
MaxScaffoldSpan 4721155
IntraScaffoldGaps 0
2KbScaffolds 2
2KbScaffoldSpan 4724879
MeanSequenceGapLength 0
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 1,4721155,4721155,4721155,0,100000091133
1 1,3724,3724,3724,0,100000091136
2 1,1928,1928,1928,0,100000091134
3 1,1580,1580,1580,0,100000091135
4 1,1011,1011,1011,0,100000091137
total 5,4729398,4729398,945880,0
Contigs
TotalContigsInScaffolds 5
TotalBasesInScaffolds 4729398
TotalVarRecords 0
MeanContigLength 945880
MinContigLength 1011
MaxContigLength 4721155
N25ContigBases 4721155
N50ContigBases 4721155
N75ContigBases 4721155
ContigAt1000000 4721155
BigContigs_greater_10000
TotalBigContigs 1
BigContigLength 4721155
MeanBigContigLength 4721155
MinBigContig 4721155
MaxBigContig 4721155
BigContigsPercentBases 99.83
SmallContigs
TotalSmallContigs 4
SmallContigLength 8243
MeanSmallContigLength 2061
MinSmallContig 1011
MaxSmallContig 3724
SmallContigsPercentBases 0.17
DegenContigs
TotalDegenContigs 0
DegenContigLength 0
DegenVarRecords 0
MeanDegenContigLength 0
MinDegenContig 0
MaxDegenContig 0
DegenPercentBases 0.00
Top5Contigs=reads,bases,EUID
0 87102,4721155,100000091128
1 2,3724,100000091132
2 5,1928,100000091129
3 2,1580,100000091130
4 11,1011,100000091131
total 87122,4729398
UniqueUnitigs
TotalUUnitigs 5
MinUUnitigLength 1321
MaxUUnitigLength 5910351
MeanUUnitigLength 1183816
SDUUnitigLength 2642214
Surrogates
TotalSurrogates 0
SurrogateInstances 0
SurrogateLength
SurrogateInstanceLength 0
UnPlacedSurrReadLen 0
PlacedSurrReadLen 0
MinSurrogateLength 0
MaxSurrogateLength 0
MeanSurrogateLength 0
SDSurrogateLength 0
Mates
ReadsWithNoMate 91122(100.00%)
ReadsWithGoodMate 0(0.00%)
ReadsWithBadShortMate 0(0.00%)
ReadsWithBadLongMate 0(0.00%)
ReadsWithSameOrientMate 0(0.00%)
ReadsWithOuttieMate 0(0.00%)
ReadsWithBothChaffMate 0(0.00%)
ReadsWithChaffMate 0(0.00%)
ReadsWithBothDegenMate 0(0.00%)
ReadsWithDegenMate 0(0.00%)
ReadsWithBothSurrMate 0(0.00%)
ReadsWithSurrogateMate 0(0.00%)
ReadsWithDiffScafMate 0(0.00%)
ReadsWithUnassignedMate 0(0.00%)
TotalScaffoldLinks 0
MeanScaffoldLinkWeight 0.00
Reads
TotalReadsInput NA
TotalUsableReads 91122
AvgClearRange 3989
ContigReads 87122(95.61%)
BigContigReads 87102(95.59%)
SmallContigReads 20(0.02%)
DegenContigReads 0(0.00%)
SurrogateReads 0(0.00%)
PlacedSurrogateReads 0(0.00%)
SingletonReads 4000(4.39%)
ChaffReads 4000(4.39%)
Coverage
ContigsOnly 75.82
Contigs_Surrogates 75.82
Contigs_Degens_Surrogates 75.82
AllReads 76.86
TotalBaseCounts
BasesCount NA
ClearRangeLengthFRG NA
ClearRangeLengthASM 363498140
SurrogateBaseLength 0
ContigBaseLength 358587763
DegenBaseLength 0
SingletonBaseLength 4910377
Contig_SurrBaseLength 358587763
gcContent
Content 40.43
Unitig Consensus
NumColumnsInUnitigs 14757314
NumGapsInUnitigs 1453
NumRunsOfGapsInUnitigReads 7565