Yersinia pestis KIM D27, using 454 8 Kbp mated reads, with CA8.1

From wgs-assembler
Jump to: navigation, search


We will be assembling an 8kb library from Yersinia pestis KIM D27, SRP001358. This is 2 lanes, one full run, of a 454 GS FLX instrument, captured in experiment SRX012379.

Fetch

mkdir READS-sra
cd READS-sra

curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR029/SRR029367/SRR029367.sra
curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR029/SRR029368/SRR029368.sra

Convert

Using sff-dump from the SRA Toolkit, convert the NCBI .sra files into 454 .sff files.

sff-dump ./SRR029367.sra
sff-dump ./SRR029368.sra

Then convert the 454 .sff files into CA .frg files. This also detects the linker sequence and generates mate pairs from the individual reads.

sffToCA -insertsize 8000 800 -libraryname SRR029367 -trim chop -linker titanium -output SRR029367 SRR029367.sff
sffToCA -insertsize 8000 800 -libraryname SRR029368 -trim chop -linker titanium -output SRR029368 SRR029368.sff

sffToCA also outputs FASTQ formatted reads, which we are not going to use.

rm -f SRR029367.1.fastq SRR029367.2.fastq SRR029367.u.fastq
rm -f SRR029368.1.fastq SRR029368.2.fastq SRR029368.u.fastq

cd ..

Assemble, three different ways

We'll assemble the reads using each of the three unitigger modules.

No special settings are needed for this assembly. I tuned the computation to fit our 16gb 6-cpu workstation with the following spec file.

ovlHashBits         = 22          #  Default = 22; uses just over 4gb memory
ovlHashBlockLength  = 180000000   #  169,898,345 bases in the reads, this will fit all into one hash table
ovlRefBlockSize     = 1000000     #  861,036 reads, this will process all in one chunk
ovlThreads          = 6           #

cnsConcurrency      = 6           #  Run 6 single-threaded consensus jobs at a time
ovlConcurrency      = 1           #  Run 1 six-thread overlap job at a time

Run times are about one hour for bogart, 2.5 hours for bog, and 5 hours for utg.

runCA -p ypestis -d ypestis-utg -s ypestis.spec unitigger=utg    READS-sra/SRR029367.frg READS-sra/SRR029368.frg
runCA -p ypestis -d ypestis-bog -s ypestis.spec unitigger=bog    READS-sra/SRR029367.frg READS-sra/SRR029368.frg
runCA -p ypestis -d ypestis-bat -s ypestis.spec unitigger=bogart READS-sra/SRR029367.frg READS-sra/SRR029368.frg

Assemble, three different ways, efficiently

Instead of trimming and overlapping the same reads three times, we can run the assembler up to the unitig stage, then simply copy data to three different assemblies.

runCA -p ypestis -d ypestis -s ypestis.spec stopBefore=unitigger  READS-sra/SRR029367.frg READS-sra/SRR029368.frg

The next step in the assembly process will be unitig construction. The easist method to bifurcate into multiple assemblies is to copy the entire directory and resume runCA in the new directories.

cp -pr ypestis ypestis-utg-copy
cp -pr ypestis ypestis-bog-copy
cp -pr ypestis ypestis-bat-copy

Then just restart runCA. Since the reads are already loaded, we no longer need to supply their paths. These are all independent and can be run concurrently.

runCA -p ypestis -d ypestis-utg-copy -s ypestis.spec unitigger=utg
runCA -p ypestis -d ypestis-bog-copy -s ypestis.spec unitigger=bog
runCA -p ypestis -d ypestis-bat-copy -s ypestis.spec unitigger=bogart

Assemble, three different ways, very efficiently

Copying the assembly directory for a branch is very wasteful. Even this small assembly contains 2.4 gb of data. With a tiny bit of effort, we can copy just the data needed to finish the compute. The remaining steps of runCA need only the gkpStore and the ovlStore. The ovlStore is never modified, while only some of the smaller files in gkpStore are modified.

The steps below will create three new directories, link to the overlaps, and copy the minimal data from the gatekeeper store, linking to the rest. The link command will complain about the copied files existing.

mkdir ypestis-utg
cd ypestis-utg
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
cp -p ../../ypestis/ypestis.gkpStore/lib .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..
mkdir ypestis-bog
cd ypestis-bog
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
cp -p ../../ypestis/ypestis.gkpStore/lib .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..
mkdir ypestis-bat
cd ypestis-bat
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
cp -p ../../ypestis/ypestis.gkpStore/lib .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..

When we restart the assembler, we need to disable Overlap Based Trimming and Overlap Error Correction. We could have also symlinked to the 0-, 1- and 3- directories.

runCA -p ypestis -d ypestis-utg -s ypestis.spec unitigger=utg    doOBT=0 doFragmentCorrection=0
runCA -p ypestis -d ypestis-bog -s ypestis.spec unitigger=bog    doOBT=0 doFragmentCorrection=0
runCA -p ypestis -d ypestis-bat -s ypestis.spec unitigger=bogart doOBT=0 doFragmentCorrection=0

Results

We chose y. pestis because it has some history, it is a difficult assembly, and because it has a reference we can compare to. We're aren't going to bother comparing against the plasmid.

QC statistics comparison

mergeqc.pl -wiki ypestis-*/*qc > qc.wiki
Files
ypestis-bat/ypestis.qc ypestis-bog/ypestis.qc ypestis-utg/ypestis.qc
Scaffolds
TotalScaffolds 7 28 2648
TotalContigsInScaffolds 86 112 3150
MeanContigsPerScaffold 12.29 4.00 1.19
MinContigsPerScaffold 1 1 1
MaxContigsPerScaffold 51 55 333
TotalBasesInScaffolds 4642830 4660718 4807191
MeanBasesInScaffolds 663261 166454 1815
MinBasesInScaffolds 1106 232 64
MaxBasesInScaffolds 4476780 4474107 2853440
N25ScaffoldBases 4476780 4474107 2853440
N50ScaffoldBases 4476780 4474107 2853440
N75ScaffoldBases 4476780 4474107 886864
ScaffoldAt1000000 4476780 4474107 2853440
TotalSpanOfScaffolds 4691040 4704090 4971253
MeanSpanOfScaffolds 670149 168003 1877
MinScaffoldSpan 1106 232 64
MaxScaffoldSpan 4507062 4497307 2928559
IntraScaffoldGaps 79 84 502
2KbScaffolds 4 8 7
2KbScaffoldSpan 4686699 4692901 4623253
MeanSequenceGapLength 610 516 327
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 51,4476780,4507062,87780,606,7180000003832 55,4474107,4497307,81347,430,7180000007757 333,2853440,2928559,8569,226,7180000076296
1 29,87136,104998,3005,638,7180000003831 27,90645,106541,3357,611,7180000007756 76,886864,911644,11669,330,7180000073673
2 2,71208,71274,35604,66,7180000003833 4,69191,69927,17298,245,7180000007755 52,574280,594989,11044,406,7180000076295
3 1,3365,3365,3365,0,7180000003827 1,4337,4337,4337,0,7180000007754 35,77303,107047,2209,875,7180000073657
4 1,1818,1818,1818,0,7180000003830 1,4198,4198,4198,0,7180000007753 10,63902,70108,6390,690,7180000073658
total 84,4640307,4688517,55242,610 88,4642478,4682310,52755,480 506,4455789,4612347,8806,312
Contigs
TotalContigsInScaffolds 86 112 3150
TotalBasesInScaffolds 4642830 4660718 4807191
TotalVarRecords 10677 10426 6902
MeanContigLength 53986 41614 1526
MinContigLength 78 109 64
MaxContigLength 591787 362731 82682
N25ContigBases 307005 296466 34987
N50ContigBases 179953 159619 22399
N75ContigBases 93977 79096 7057
ContigAt1000000 307005 323707 40280
ContigAt2000000 223396 181407 26940
ContigAt3000000 130053 121311 14381
ContigAt4000000 59627 49342 4463
BigContigs_greater_10000
TotalBigContigs 40 44 126
BigContigLength 4523969 4506415 3313930
MeanBigContigLength 113099 102419 26301
MinBigContig 10111 11285 10297
MaxBigContig 591787 362731 82682
BigContigsPercentBases 97.44 96.69 68.94
SmallContigs
TotalSmallContigs 46 68 3024
SmallContigLength 118861 154303 1493261
MeanSmallContigLength 2584 2269 494
MinSmallContig 78 109 64
MaxSmallContig 8668 9434 9837
SmallContigsPercentBases 2.56 3.31 31.06
DegenContigs
TotalDegenContigs 200 640 5015
DegenContigLength 64118 195584 1755131
DegenVarRecords 61 94 317
MeanDegenContigLength 321 306 350
MinDegenContig 68 68 64
MaxDegenContig 3023 3797 1630
DegenPercentBases 1.38 4.20 36.51
Top5Contigs=reads,bases,EUID
0 81994,591787,7180000003813 54068,362731,7180000007724 9944,82682,7180000073572
1 60580,402193,7180000003820 50700,332773,7180000007723 9176,74053,7180000073647
2 46809,307005,7180000003791 47854,323707,7180000007687 8292,65698,7180000073231
3 42419,281879,7180000003789 40279,296466,7180000007695 7414,59557,7180000073651
4 34839,235743,7180000003819 34579,236017,7180000007688 6614,59346,7180000073644
total 266641,1818607 227480,1551694 41440,341336
UniqueUnitigs
TotalUUnitigs 2998 5064 57708
MinUUnitigLength 64 64 64
MaxUUnitigLength 94700 22406 5930
MeanUUnitigLength 1706 1087 212
SDUUnitigLength 7707 3295 316
Surrogates
TotalSurrogates 343 1274 2765
SurrogateInstances 1293 2954 2809
SurrogateLength 132817 439069 1139541
SurrogateInstanceLength 598155 1281263 1171526
UnPlacedSurrReadLen 3516329 3764697 7432137
PlacedSurrReadLen 1286249 1536722 1892217
MinSurrogateLength 72 72 64
MaxSurrogateLength 10897 7381 2370
MeanSurrogateLength 387 345 412
SDSurrogateLength 682 257 184
Mates
ReadsWithNoMate 351131(50.14%) 351131(50.14%) 351131(50.14%)
ReadsWithGoodMate 324202(46.29%) 321500(45.91%) 254636(36.36%)
ReadsWithBadShortMate 82(0.01%) 10(0.00%) 82(0.01%)
ReadsWithBadLongMate 3590(0.51%) 3416(0.49%) 648(0.09%)
ReadsWithSameOrientMate 6692(0.96%) 6538(0.93%) 880(0.13%)
ReadsWithOuttieMate 3934(0.56%) 3840(0.55%) 854(0.12%)
ReadsWithBothChaffMate 62(0.01%) 74(0.01%) 27482(3.92%)
ReadsWithChaffMate 1234(0.18%) 1812(0.26%) 40730(5.82%)
ReadsWithBothDegenMate 88(0.01%) 420(0.06%) 3620(0.52%)
ReadsWithDegenMate 1766(0.25%) 5016(0.72%) 11810(1.69%)
ReadsWithBothSurrMate 422(0.06%) 220(0.03%) 0(0.00%)
ReadsWithSurrogateMate 5714(0.82%) 2974(0.42%) 166(0.02%)
ReadsWithDiffScafMate 1408(0.20%) 3374(0.48%) 8286(1.18%)
ReadsWithUnassignedMate 0(0.00%) 0(0.00%) 0(0.00%)
TotalScaffoldLinks 11 24 2
MeanScaffoldLinkWeight 33.00 51.04 15.00
Reads
TotalReadsInput NA NA NA
TotalUsableReads 700325 700325 700325
AvgClearRange 195 195 195
ContigReads 678116(96.83%) 673519(96.17%) 509010(72.68%)
BigContigReads 667386(95.30%) 659807(94.21%) 388927(55.54%)
SmallContigReads 10730(1.53%) 13712(1.96%) 120083(17.15%)
DegenContigReads 2182(0.31%) 4375(0.62%) 32476(4.64%)
SurrogateReads 24507(3.50%) 27112(3.87%) 43624(6.23%)
PlacedSurrogateReads 8183(1.17%) 9795(1.40%) 11096(1.58%)
SingletonReads 3703(0.53%) 5114(0.73%) 126311(18.04%)
ChaffReads 3702(0.53%) 5112(0.73%) 126289(18.03%)
Coverage
ContigsOnly 28.41 28.10 20.58
Contigs_Surrogates 29.17 28.91 22.12
Contigs_Degens_Surrogates 28.87 27.93 17.38
AllReads 29.43 29.32 28.42
TotalBaseCounts
BasesCount NA NA NA
ClearRangeLengthFRG NA NA NA
ClearRangeLengthASM 136636145 136636414 136636593
SurrogateBaseLength 4802578 5301419 9324354
ContigBaseLength 131915332 130957121 98912067
DegenBaseLength 479897 904614 7685550
SingletonBaseLength 724587 1009982 22606839
Contig_SurrBaseLength 135431661 134721818 106344204
gcContent
Content 44.94 44.96 45.51
Unitig Consensus
NumColumnsInUnitigs 18226322 21441149 113226615
NumGapsInUnitigs 803701 797791 671827
NumRunsOfGapsInUnitigReads 25308048 23414668 14352490
Contig Consensus
NumColumnsInUnitigs 4990880 5140449 6795020
NumGapsInUnitigs 283970 284161 232707
NumRunsOfGapsInUnitigReads 8451563 8319732 5491065
NumColumnsInContigs 4980981 5130881 6788604
NumGapsInContigs 274027 274572 226281
NumRunsOfGapsInContigReads 7937691 7824259 5205609
NumAAMismatches 11940 11543 7784
NumVARStringsWithFlankingGaps 2839 2777 1657

Dot plot comparison

The NCBI GenBank entry has everything you need to know, including the reference. The reference is also available directly courtesy of the University of Wisconsin - Madison E. coli Genome Project.

dotplot.sh ypestis-utg/CTG AE009952.fas ypestis-utg/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-utg/SCF AE009952.fas ypestis-utg/9-terminator/ypestis.scf.fasta

dotplot.sh ypestis-bog/CTG AE009952.fas ypestis-bog/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-bog/SCF AE009952.fas ypestis-bog/9-terminator/ypestis.scf.fasta

dotplot.sh ypestis-bat/CTG AE009952.fas ypestis-bat/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-bat/SCF AE009952.fas ypestis-bat/9-terminator/ypestis.scf.fasta

unitigger=utg scaffolds: Scaffolds from unitigger=utg

unitigger=bog scaffolds: Scaffolds from unitigger=bog

unitigger=bogart scaffolds: Scaffolds from unitigger=bogart