Yersinia pestis KIM D27, using 454 8 Kbp mated reads, with CA8.2

From wgs-assembler
Jump to: navigation, search


We will be assembling an 8kb library from Yersinia pestis KIM D27, SRP001358. This is 2 lanes, one full run, of a 454 GS FLX instrument, captured in experiment SRX012379.

Fetch

mkdir READS-sra
cd READS-sra

curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR029/SRR029367/SRR029367.sra
curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR029/SRR029368/SRR029368.sra

Convert

Using sff-dump from the SRA Toolkit, convert the NCBI .sra files into 454 .sff files.

sff-dump ./SRR029367.sra
sff-dump ./SRR029368.sra

Then convert the 454 .sff files into CA .frg files. This also detects the linker sequence and generates mate pairs from the individual reads.

sffToCA -insertsize 8000 800 -libraryname SRR029367 -trim chop -linker titanium -output SRR029367 SRR029367.sff
sffToCA -insertsize 8000 800 -libraryname SRR029368 -trim chop -linker titanium -output SRR029368 SRR029368.sff

sffToCA also outputs FASTQ formatted reads, which we are not going to use.

rm -f SRR029367.1.fastq SRR029367.2.fastq SRR029367.u.fastq
rm -f SRR029368.1.fastq SRR029368.2.fastq SRR029368.u.fastq

cd ..

Assemble, three different ways

We'll assemble the reads using each of the three unitigger modules.

No special settings are needed for this assembly. I tuned the computation to fit our 16gb 6-cpu workstation with the following spec file.

ovlHashBits         = 22          #  Default = 22; uses just over 4gb memory
ovlHashBlockLength  = 180000000   #  169,898,345 bases in the reads, this will fit all into one hash table
ovlRefBlockSize     = 1000000     #  861,036 reads, this will process all in one chunk
ovlThreads          = 6           #

cnsConcurrency      = 6           #  Run 6 single-threaded consensus jobs at a time
ovlConcurrency      = 1           #  Run 1 six-thread overlap job at a time

Run times are about 1.5 hours for bogart, 4.5 hours for bog, and 6.5 hours for utg. Times can be reduced by disabling computeInsertSize, increasing cgwMinMergeWeight and using cgwMergeFilterLevel=2, though the assembly might degrade.

runCA -p ypestis -d ypestis-utg -s ypestis.spec unitigger=utg    READS-sra/SRR029367.frg READS-sra/SRR029368.frg
runCA -p ypestis -d ypestis-bog -s ypestis.spec unitigger=bog    READS-sra/SRR029367.frg READS-sra/SRR029368.frg
runCA -p ypestis -d ypestis-bat -s ypestis.spec unitigger=bogart READS-sra/SRR029367.frg READS-sra/SRR029368.frg

Assemble, three different ways, efficiently

Instead of trimming and overlapping the same reads three times, we can run the assembler up to the unitig stage, then simply copy data to three different assemblies.

runCA -p ypestis -d ypestis -s ypestis.spec stopBefore=unitigger  READS-sra/SRR029367.frg READS-sra/SRR029368.frg

The next step in the assembly process will be unitig construction. The easist method to bifurcate into multiple assemblies is to copy the entire directory and resume runCA in the new directories.

cp -pr ypestis ypestis-utg-copy
cp -pr ypestis ypestis-bog-copy
cp -pr ypestis ypestis-bat-copy

Then just restart runCA. Since the reads are already loaded, we no longer need to supply their paths. These are all independent and can be run concurrently.

runCA -p ypestis -d ypestis-utg-copy -s ypestis.spec unitigger=utg
runCA -p ypestis -d ypestis-bog-copy -s ypestis.spec unitigger=bog
runCA -p ypestis -d ypestis-bat-copy -s ypestis.spec unitigger=bogart

CAVEAT: The default error rate allowed in overlaps used for trimming (obtErrorRate) is set based on the unitigger -- 1.5% for unitigger=utg, and 3.0% for unitigger=bog and unitigger=bat. This slightly changes the read trimming, which affects the assembly of unitigger=bogart. The genome is still in one scaffold, but there are many hundreds of additional small scaffolds.

Assemble, three different ways, very efficiently

Copying the assembly directory for a branch is very wasteful. Even this small assembly contains 2.4 gb of data. With a tiny bit of effort, we can copy just the data needed to finish the compute. The remaining steps of runCA need only the gkpStore and the ovlStore. The ovlStore is never modified, while only some of the smaller files in gkpStore are modified.

The steps below will create three new directories, link to the overlaps, and copy the minimal data from the gatekeeper store, linking to the rest. The link command will complain about the copied files existing.

mkdir ypestis-utg-link
cd ypestis-utg-link
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
cp -p ../../ypestis/ypestis.gkpStore/lib .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..
mkdir ypestis-bog-link
cd ypestis-bog-link
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
cp -p ../../ypestis/ypestis.gkpStore/lib .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..
mkdir ypestis-bat-link
cd ypestis-bat-link
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
cp -p ../../ypestis/ypestis.gkpStore/lib .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..

When we restart the assembler, we need to disable Overlap Based Trimming and Overlap Error Correction. We could have also symlinked to the 0-, 1- and 3- directories.

runCA -p ypestis -d ypestis-utg-link -s ypestis.spec unitigger=utg    doOBT=0 doFragmentCorrection=0
runCA -p ypestis -d ypestis-bog-link -s ypestis.spec unitigger=bog    doOBT=0 doFragmentCorrection=0
runCA -p ypestis -d ypestis-bat-link -s ypestis.spec unitigger=bogart doOBT=0 doFragmentCorrection=0

Results

We chose y. pestis because it has some history, it is a difficult assembly, and because it has a reference we can compare to. We're aren't going to bother comparing against the plasmid.

QC statistics comparison

mergeqc.pl -wiki ypestis-utg/*qc ypestis-bog/*qc ypestis-bat/*qc ypestis-bat-copy/*qc 

The last column shows the "efficient" unitigger=bogart assembly (where obtErrorRate=0.015) the second to last column shows the "normal" unitigger=bogart assembly (where obtErrorRate=0.03). The last two columns, for unitigger=bog and unitigger=utg, do not change based on trimming.

Files
ypestis-utg/ypestis.qc ypestis-bog/ypestis.qc ypestis-bat/ypestis.qc ypestis-bat-copy/ypestis.qc
Scaffolds
TotalScaffolds 2648 28 10 318
TotalContigsInScaffolds 3150 112 84 394
MeanContigsPerScaffold 1.19 4.00 8.40 1.24
MinContigsPerScaffold 1 1 1 1
MaxContigsPerScaffold 333 55 48 47
TotalBasesInScaffolds 4807191 4660718 4634018 4676813
MeanBasesInScaffolds 1815 166454 463402 14707
MinBasesInScaffolds 64 232 72 64
MaxBasesInScaffolds 2853440 4474107 4460674 4476890
N25ScaffoldBases 2853440 4474107 4460674 4476890
N50ScaffoldBases 2853440 4474107 4460674 4476890
N75ScaffoldBases 886864 4474107 4460674 4476890
ScaffoldAt1000000 2853440 4474107 4460674 4476890
ScaffoldAt3000000 886864 N/A N/A N/A
ScaffoldAt4000000 574280 N/A N/A N/A
TotalSpanOfScaffolds 4971253 4704090 4679740 4718285
MeanSpanOfScaffolds 1877 168003 467974 14837
MinScaffoldSpan 64 232 72 64
MaxScaffoldSpan 2928559 4497307 4490992 4506147
IntraScaffoldGaps 502 84 74 76
2KbScaffolds 7 8 5 4
2KbScaffoldSpan 4623253 4692901 4674787 4677017
MeanSequenceGapLength 327 516 618 546
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 333,2853440,2928559,8569,226,76297 55,4474107,4497307,81347,430,7758 48,4460674,4490992,92931,645,3788 47,4476890,4506147,95253,636,4163
1 76,886864,911644,11669,330,73674 27,90645,106541,3357,611,7757 27,91003,106347,3370,590,3787 30,84082,96231,2803,419,4161
2 52,574280,594989,11044,406,76296 4,69191,69927,17298,245,7756 2,71208,71268,35604,60,3786 2,71208,71274,35604,66,4162
3 35,77303,107047,2209,875,73658 1,4337,4337,4337,0,7755 1,3448,3448,3448,0,3782 1,3365,3365,3365,0,4157
4 10,63902,70108,6390,690,73659 1,4198,4198,4198,0,7754 1,2732,2732,2732,0,3781 1,1818,1818,1818,0,4160
total 506,4455789,4612347,8806,312 88,4642478,4682310,52755,480 79,4629065,4674787,58596,618 81,4637363,4678835,57251,546
Contigs
TotalContigsInScaffolds 3150 112 84 394
TotalBasesInScaffolds 4807191 4660718 4634018 4676813
TotalVarRecords 6902 10426 11233 10651
MeanContigLength 1526 41614 55167 11870
MinContigLength 64 109 72 64
MaxContigLength 82682 362731 595392 591787
N25ContigBases 34987 296466 307008 387428
N50ContigBases 22399 159619 158305 179953
N75ContigBases 7057 79096 92636 93977
ContigAt1000000 40280 323707 307008 387428
ContigAt2000000 26940 181407 179958 223647
ContigAt3000000 14381 121311 121459 130053
ContigAt4000000 4463 49342 59627 59627
BigContigs_greater_10000
TotalBigContigs 126 44 40 39
BigContigLength 3313930 4506415 4515472 4532969
MeanBigContigLength 26301 102419 112887 116230
MinBigContig 10297 11285 10663 10111
MaxBigContig 82682 362731 595392 591787
BigContigsPercentBases 68.94 96.69 97.44 96.92
SmallContigs
TotalSmallContigs 3024 68 44 355
SmallContigLength 1493261 154303 118546 143844
MeanSmallContigLength 494 2269 2694 405
MinSmallContig 64 109 72 64
MaxSmallContig 9837 9434 8935 8668
SmallContigsPercentBases 31.06 3.31 2.56 3.08
DegenContigs
TotalDegenContigs 5015 640 203 216
DegenContigLength 1755131 195584 72562 67859
DegenVarRecords 317 94 105 65
MeanDegenContigLength 350 306 357 314
MinDegenContig 64 68 68 68
MaxDegenContig 1630 3797 6882 3023
DegenPercentBases 36.51 4.20 1.57 1.45
Top5Contigs=reads,bases,EUID
0 9944,82682,73573 54068,362731,7725 82831,595392,3759 81994,591787,4142
1 9176,74053,73648 50700,332773,7724 60542,402191,3766 60580,402193,4149
2 8292,65698,73232 47854,323707,7688 46905,307008,3742 58268,387428,4156
3 7414,59557,73652 40279,296466,7696 34968,236288,3765 46809,307005,4120
4 6614,59346,73645 34579,236017,7689 30548,199599,3777 34839,235743,4148
total 41440,341336 227480,1551694 255794,1740478 282490,1924156
UniqueUnitigs
TotalUUnitigs 57708 5064 2952 3000
MinUUnitigLength 64 64 64 64
MaxUUnitigLength 5930 22406 95116 94700
MeanUUnitigLength 212 1087 1740 1705
SDUUnitigLength 316 3295 7877 7704
Surrogates
TotalSurrogates 2765 1274 338 330
SurrogateInstances 2809 2954 1333 1181
SurrogateLength 1139541 439465 126914 131413
SurrogateInstanceLength 1171526 1281263 609081 557481
UnPlacedSurrReadLen 7432137 3764697 3129653 3540879
PlacedSurrReadLen 1892217 1536722 1224981 1250476
MinSurrogateLength 64 72 72 72
MaxSurrogateLength 2370 7381 6342 10897
MeanSurrogateLength 412 345 375 398
SDSurrogateLength 184 259 412 713
Mates
ReadsWithNoMate 351131(50.14%) 351131(50.14%) 350984(50.04%) 351131(50.14%)
ReadsWithGoodMate 254636(36.36%) 321500(45.91%) 324180(46.22%) 322972(46.12%)
ReadsWithBadShortMate 82(0.01%) 10(0.00%) 82(0.01%) 92(0.01%)
ReadsWithBadLongMate 648(0.09%) 3416(0.49%) 3600(0.51%) 3584(0.51%)
ReadsWithSameOrientMate 880(0.13%) 6538(0.93%) 6606(0.94%) 6692(0.96%)
ReadsWithOuttieMate 854(0.12%) 3840(0.55%) 3864(0.55%) 3974(0.57%)
ReadsWithBothChaffMate 27482(3.92%) 74(0.01%) 60(0.01%) 62(0.01%)
ReadsWithChaffMate 40730(5.82%) 1812(0.26%) 1366(0.19%) 1224(0.17%)
ReadsWithBothDegenMate 3620(0.52%) 420(0.06%) 256(0.04%) 98(0.01%)
ReadsWithDegenMate 11810(1.69%) 5016(0.72%) 3692(0.53%) 1828(0.26%)
ReadsWithBothSurrMate 0(0.00%) 220(0.03%) 296(0.04%) 428(0.06%)
ReadsWithSurrogateMate 166(0.02%) 2974(0.42%) 4672(0.67%) 6202(0.89%)
ReadsWithDiffScafMate 8286(1.18%) 3374(0.48%) 1712(0.24%) 2038(0.29%)
ReadsWithUnassignedMate 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%)
TotalScaffoldLinks 2 24 4 5
MeanScaffoldLinkWeight 15.00 51.04 11.75 8.80
Reads
TotalReadsInput NA NA NA NA
TotalUsableReads 700325 700325 701370 700325
AvgClearRange 195 195 197 195
ContigReads 509010(72.68%) 673519(96.17%) 678721(96.77%) 677785(96.78%)
BigContigReads 388927(55.54%) 659807(94.21%) 669003(95.39%) 668190(95.41%)
SmallContigReads 120083(17.15%) 13712(1.96%) 9718(1.39%) 9595(1.37%)
DegenContigReads 32476(4.64%) 4375(0.62%) 4532(0.65%) 2362(0.34%)
SurrogateReads 43624(6.23%) 27112(3.87%) 22030(3.14%) 24422(3.49%)
PlacedSurrogateReads 11096(1.58%) 9795(1.40%) 7686(1.10%) 7942(1.13%)
SingletonReads 126311(18.04%) 5114(0.73%) 3773(0.54%) 3698(0.53%)
ChaffReads 126289(18.03%) 5112(0.73%) 3771(0.54%) 3697(0.53%)
Coverage
ContigsOnly 20.58 28.10 28.79 28.19
Contigs_Surrogates 22.12 28.91 29.46 28.95
Contigs_Degens_Surrogates 17.38 27.93 29.21 28.65
AllReads 28.42 29.32 29.83 29.22
TotalBaseCounts
BasesCount NA NA NA NA
ClearRangeLengthFRG NA NA NA NA
ClearRangeLengthASM 136636593 136636414 138213787 136635927
SurrogateBaseLength 9324354 5301419 4354634 4791355
ContigBaseLength 98912067 130957121 133403933 131861981
DegenBaseLength 7685550 904614 945847 510290
SingletonBaseLength 22606839 1009982 734354 722777
Contig_SurrBaseLength 106344204 134721818 136533586 135402860
gcContent
Content 45.51 44.96 44.71 44.97
Unitig Consensus
NumColumnsInUnitigs 113226615 21442333 18426856 18228609
NumGapsInUnitigs 671827 798973 877351 805990
NumRunsOfGapsInUnitigReads 14352490 23593898 27967454 25716797
Contig Consensus
NumColumnsInUnitigs 6795020 5140447 5014933 5028264
NumGapsInUnitigs 232707 284159 308351 283630
NumRunsOfGapsInUnitigReads 5491065 8319692 9277263 8440606
NumColumnsInContigs 6788604 5130881 5003810 5018381
NumGapsInContigs 226281 274572 297226 273703
NumRunsOfGapsInContigReads 5205609 7824253 8667948 7927864
NumAAMismatches 7784 11543 12641 11919
NumVARStringsWithFlankingGaps 1657 2777 3072 2834

Dot plot comparison

The NCBI GenBank entry has everything you need to know, including the reference. The reference is also available directly courtesy of the University of Wisconsin - Madison E. coli Genome Project.

dotplot.sh ypestis-utg/CTG      AE009952.fas ypestis-utg/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-utg/SCF      AE009952.fas ypestis-utg/9-terminator/ypestis.scf.fasta

dotplot.sh ypestis-bog/CTG      AE009952.fas ypestis-bog/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-bog/SCF      AE009952.fas ypestis-bog/9-terminator/ypestis.scf.fasta

dotplot.sh ypestis-bat/CTG      AE009952.fas ypestis-bat/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-bat/SCF      AE009952.fas ypestis-bat/9-terminator/ypestis.scf.fasta

dotplot.sh ypestis-bat-copy/CTG AE009952.fas ypestis-bat-copy/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-bat-copy/SCF AE009952.fas ypestis-bat-copy/9-terminator/ypestis.scf.fasta

unitigger=utg scaffolds: Scaffolds from unitigger=utg

unitigger=bog scaffolds: Scaffolds from unitigger=bog

unitigger=bogart scaffolds: Scaffolds from unitigger=bogart

unitigger=bogart obtErrorRate=0.015 scaffolds: Scaffolds from unitigger=bogart, obtErrorRate=0.015