Yersinia pestis KIM D27, using 454 8 Kbp mated reads, with CA8

From wgs-assembler
Jump to: navigation, search


We will be assembling an 8kb library from Yersinia pestis KIM D27, SRP001358. This is 2 lanes, one full run, of a 454 GS FLX instrument, captured in experiment SRX012379.

Fetch

mkdir READS-sra
cd READS-sra

curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR029/SRR029367/SRR029367.sra
curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR029/SRR029368/SRR029368.sra

Convert

First, convert the NCBI .sra files into 454 .sff files.

sff-dump ./SRR029367.sra
sff-dump ./SRR029368.sra

Then convert the 454 .sff files into CA .frg files. This also detects the linker sequence and generates mate pairs from the individual reads.

sffToCA -insertsize 8000 800 -libraryname SRR029367 -trim chop -linker titanium -output SRR029367 SRR029367.sff
sffToCA -insertsize 8000 800 -libraryname SRR029368 -trim chop -linker titanium -output SRR029368 SRR029368.sff

sffToCA also outputs FASTQ formatted reads, which we are not going to use.

rm -f SRR029367.1.fastq SRR029367.2.fastq SRR029367.u.fastq
rm -f SRR029368.1.fastq SRR029368.2.fastq SRR029368.u.fastq

cd ..

Assemble, three different ways

We'll assemble the reads using each of the three unitigger modules.

WARNING: The utg module is taking a long time to generate scaffolds. Run times are about one hour for bogart, about 2.5 hours for bog, and about 13 hours for unitigger.

runCA -p ypestis -d ypestis-utg -s ypestis.spec unitigger=utg    READS-sra/SRR029367.frg READS-sra/SRR029368.frg
runCA -p ypestis -d ypestis-bog -s ypestis.spec unitigger=bog    READS-sra/SRR029367.frg READS-sra/SRR029368.frg
runCA -p ypestis -d ypestis-bat -s ypestis.spec unitigger=bogart READS-sra/SRR029367.frg READS-sra/SRR029368.frg

Assemble, three different ways, efficiently

Instead of trimming and overlapping the same reads three times, we can run the assembler up to the unitig stage, then simply copy data to three different assemblies.

runCA -p ypestis -d ypestis -s ypestis.spec stopBefore=unitigger  READS-sra/SRR029367.frg READS-sra/SRR029368.frg

The next step in the assembly process will be unitig construction. The easist method to bifurcate into multiple assemblies is to copy the entire directory and resume runCA in the new directories. Since the reads are already loaded, we can drop the reads from the command.

cp -pr ypestis ypestis-utg-copy
cp -pr ypestis ypestis-bog-copy
cp -pr ypestis ypestis-bat-copy

runCA -p ypestis -d ypestis-utg-copy -s ypestis.spec unitigger=utg
runCA -p ypestis -d ypestis-bog-copy -s ypestis.spec unitigger=bog
runCA -p ypestis -d ypestis-bat-copy -s ypestis.spec unitigger=bogart

Assemble, three different ways, very efficiently

However, copying the assembly directory for a branch is very wasteful. Even this small assembly contains 2.4 gb of data.

The remaining steps of runCA need only the gkpStore and the ovlStore. The ovlStore is never modified, while only some of the smaller files in gkpStore are modified. Instead of symlinking to the 0-, 1- and 3- directories, we disable those steps.

mkdir ypestis-utg
cd ypestis-utg
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..
mkdir ypestis-bog
cd ypestis-bog
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..
mkdir ypestis-bat
cd ypestis-bat
ln -s ../ypestis/ypestis.ovlStore .
mkdir ypestis.gkpStore
cd ypestis.gkpStore
cp -p ../../ypestis/ypestis.gkpStore/inf .
cp -p ../../ypestis/ypestis.gkpStore/fnm .
cp -p ../../ypestis/ypestis.gkpStore/fpk .
cp -p ../../ypestis/ypestis.gkpStore/fsb .
ln -s ../../ypestis/ypestis.gkpStore/*   .
cd ../..
runCA -p ypestis -d ypestis-utg -s ypestis.spec unitigger=utg    doOBT=0 doFragmentCorrection=0
runCA -p ypestis -d ypestis-bog -s ypestis.spec unitigger=bog    doOBT=0 doFragmentCorrection=0
runCA -p ypestis -d ypestis-bat -s ypestis.spec unitigger=bogart doOBT=0 doFragmentCorrection=0

Results

We chose y. pestis because it has some history, it is a difficult assembly, and because it has a reference we can compare to. We're aren't going to bother comparing against the plasmid.

QC statistics comparison

mergeqc.pl -wiki ypestis-*/*qc > qc.wiki
Files
ypestis-bat/ypestis.qc ypestis-bog/ypestis.qc ypestis-utg/ypestis.qc
Scaffolds
TotalScaffolds 9 23 3953
TotalContigsInScaffolds 108 104 4388
MeanContigsPerScaffold 12.00 4.52 1.11
MinContigsPerScaffold 1 1 1
MaxContigsPerScaffold 66 43 185
TotalBasesInScaffolds 4642025 4665331 5052893
MeanBasesInScaffolds 515781 202840 1278
MinBasesInScaffolds 600 182 64
MaxBasesInScaffolds 4474763 4490594 2030033
N25ScaffoldBases 4474763 4490594 2030033
N50ScaffoldBases 4474763 4490594 1333940
N75ScaffoldBases 4474763 4490594 1005382
ScaffoldAt1000000 4474763 4490594 2030033
TotalSpanOfScaffolds 4696701 4694844 5198974
MeanSpanOfScaffolds 521856 204124 1315
MinScaffoldSpan 600 182 64
MaxScaffoldSpan 4511842 4502224 2060627
IntraScaffoldGaps 99 81 435
2KbScaffolds 4 6 8
2KbScaffoldSpan 4690695 4687060 4667272
MeanSequenceGapLength 552 364 336
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 66,4474763,4511842,67799,570,3705 43,4490594,4502224,104432,277,6991 185,2030033,2060627,10973,166,69873
1 33,87663,104817,2656,536,3704 36,87476,104704,2430,492,6989 107,1333940,1364903,12467,292,71379
2 3,70557,71000,23519,222,3703 4,69462,70116,17366,218,6988 96,1005382,1033263,10473,293,69817
3 1,3036,3036,3036,0,3698 1,3956,3956,3956,0,6981 37,71747,105752,1939,945,69530
4 1,1870,1870,1870,0,3702 2,3445,3446,1722,1,6986 13,68934,78006,5303,756,69540
total 104,4637889,4692565,44595,552 86,4654933,4684446,54127,364 438,4510036,4642551,10297,306
Contigs
TotalContigsInScaffolds 108 104 4388
TotalBasesInScaffolds 4642025 4665331 5052893
TotalVarRecords 9596 8803 6234
MeanContigLength 42982 44859 1152
MinContigLength 150 77 64
MaxContigLength 413458 454299 159731
N25ContigBases 236289 402154 52397
N50ContigBases 141532 184360 25707
N75ContigBases 68261 94211 7957
ContigAt1000000 275741 402154 59857
ContigAt2000000 154585 195760 32713
ContigAt3000000 79892 121312 19233
ContigAt4000000 41539 71786 5516
BigContigs_greater_10000
TotalBigContigs 53 38 118
BigContigLength 4481933 4536115 3625570
MeanBigContigLength 84565 119371 30725
MinBigContig 10739 11491 10184
MaxBigContig 413458 454299 159731
BigContigsPercentBases 96.55 97.23 71.75
SmallContigs
TotalSmallContigs 55 66 4270
SmallContigLength 160092 129216 1427323
MeanSmallContigLength 2911 1958 334
MinSmallContig 150 77 64
MaxSmallContig 9913 9501 9139
SmallContigsPercentBases 3.45 2.77 28.25
DegenContigs
TotalDegenContigs 169 483 3883
DegenContigLength 65768 152357 1350007
DegenVarRecords 92 65 224
MeanDegenContigLength 389 315 348
MinDegenContig 64 64 64
MaxDegenContig 6938 3057 1630
DegenPercentBases 1.42 3.27 26.72
Top5Contigs=reads,bases,EUID
0 62974,413458,3666 61751,454299,6953 18603,159731,69521
1 53829,360635,3681 62335,411510,6929 14378,113140,69524
2 41749,275741,3668 59816,402154,6959 10500,88209,69338
3 34792,236289,3682 37086,255916,6937 9937,78101,69392
4 31537,211439,3648 35470,246191,6958 10207,76819,69464
total 224881,1497562 256458,1770070 63625,516000
UniqueUnitigs
TotalUUnitigs 2967 4814 55135
MinUUnitigLength 64 64 64
MaxUUnitigLength 94921 22261 7862
MeanUUnitigLength 1721 1129 218
SDUUnitigLength 7797 3429 359
Surrogates
TotalSurrogates 283 1084 2236
SurrogateInstances 1151 2743 2315
SurrogateLength 127971 394247 921388
SurrogateInstanceLength 565955 1289703 964117
UnPlacedSurrReadLen 3581577 3693412 5904831
PlacedSurrReadLen 1139787 1560764 1602193
MinSurrogateLength 74 77 74
MaxSurrogateLength 20986 7239 2048
MeanSurrogateLength 452 364 412
SDSurrogateLength 1297 263 183
Mates
ReadsWithNoMate 352104(50.43%) 352300(50.55%) 352300(50.55%)
ReadsWithGoodMate 318628(45.64%) 320220(45.94%) 265318(38.07%)
ReadsWithBadShortMate 82(0.01%) 12(0.00%) 60(0.01%)
ReadsWithBadLongMate 3594(0.51%) 3382(0.49%) 602(0.09%)
ReadsWithSameOrientMate 6568(0.94%) 6460(0.93%) 772(0.11%)
ReadsWithOuttieMate 3880(0.56%) 3812(0.55%) 748(0.11%)
ReadsWithBothChaffMate 60(0.01%) 80(0.01%) 21324(3.06%)
ReadsWithChaffMate 1152(0.16%) 1644(0.24%) 32118(4.61%)
ReadsWithBothDegenMate 230(0.03%) 136(0.02%) 2078(0.30%)
ReadsWithDegenMate 3680(0.53%) 2872(0.41%) 9334(1.34%)
ReadsWithBothSurrMate 1068(0.15%) 342(0.05%) 0(0.00%)
ReadsWithSurrogateMate 5578(0.80%) 3428(0.49%) 232(0.03%)
ReadsWithDiffScafMate 1566(0.22%) 2288(0.33%) 12090(1.73%)
ReadsWithUnassignedMate 0(0.00%) 0(0.00%) 0(0.00%)
TotalScaffoldLinks 20 17 0
MeanScaffoldLinkWeight 21.65 41.00 0.00
Reads
TotalReadsInput NA NA NA
TotalUsableReads 698190 696976 696976
AvgClearRange 197 195 195
ContigReads 673408(96.45%) 672106(96.43%) 540506(77.55%)
BigContigReads 658726(94.35%) 660926(94.83%) 443104(63.58%)
SmallContigReads 14682(2.10%) 11180(1.60%) 97402(13.97%)
DegenContigReads 4452(0.64%) 3204(0.46%) 23928(3.43%)
SurrogateReads 24017(3.44%) 26761(3.84%) 35289(5.06%)
PlacedSurrogateReads 7305(1.05%) 9904(1.42%) 9475(1.36%)
SingletonReads 3618(0.52%) 4809(0.69%) 106728(15.31%)
ChaffReads 3615(0.52%) 4807(0.69%) 106718(15.31%)
Coverage
ContigsOnly 28.46 27.92 20.74
Contigs_Surrogates 29.23 28.71 21.91
Contigs_Degens_Surrogates 29.02 27.94 18.18
AllReads 29.59 29.06 26.83
TotalBaseCounts
BasesCount NA NA NA
ClearRangeLengthFRG NA NA NA
ClearRangeLengthASM 137349023 135570134 135569552
SurrogateBaseLength 4721364 5254176 7507024
ContigBaseLength 132121986 130250055 104779555
DegenBaseLength 937103 675878 5699413
SingletonBaseLength 708357 950789 19185753
Contig_SurrBaseLength 135703563 133943467 110684386
gcContent
Content 44.81 45.08 45.51
Unitig Consensus
NumColumnsInUnitigs 18035026 20790865 100420111
NumGapsInUnitigs 830550 759246 650806
NumRunsOfGapsInUnitigReads 26499741 22469227 14626985
Contig Consensus
NumColumnsInUnitigs 5001692 5087407 6630010
NumGapsInUnitigs 293907 269722 227118
NumRunsOfGapsInUnitigReads 8807114 7850276 5579334
NumColumnsInContigs 4991543 5078784 6623848
NumGapsInContigs 283748 261093 220941
NumRunsOfGapsInContigReads 8251426 7407869 5300094
NumAAMismatches 10782 9745 6926
NumVARStringsWithFlankingGaps 2709 2366 1507

Dot plot comparison

The NCBI GenBank entry has everything you need to know, including the reference. The reference is also available directly courtesy of the University of Wisconsin - Madison E. coli Genome Project.

dotplot.sh ypestis-utg/CTG AE009952.fas ypestis-utg/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-utg/SCF AE009952.fas ypestis-utg/9-terminator/ypestis.scf.fasta

dotplot.sh ypestis-bog/CTG AE009952.fas ypestis-bog/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-bog/SCF AE009952.fas ypestis-bog/9-terminator/ypestis.scf.fasta

dotplot.sh ypestis-bat/CTG AE009952.fas ypestis-bat/9-terminator/ypestis.ctg.fasta
dotplot.sh ypestis-bat/SCF AE009952.fas ypestis-bat/9-terminator/ypestis.scf.fasta

unitigger=utg scaffolds: Scaffolds from unitigger=utg

unitigger=bog scaffolds: Scaffolds from unitigger=bog

unitigger=bogart scaffolds: Scaffolds from unitigger=bogart