Porphyromonas gingivalis W83, using 454 3 Kbp mated reads, with CA8

From wgs-assembler
Jump to: navigation, search

We will assemble our favorite test case, Porphyromonas gingivalis, from two lanes of 454 3 Kbp mate pair reads. We'll discard the unmated reads.

Rather than write up the conversion, here are the reads.

Or, with curl:

curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS01.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS01.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS02.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS02.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO01.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO01.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO02.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO02.mate.frg.xz

The following script will run the assemblies. It will run the 'test-orig' assembly to the end of unitig construction, including consensus, then stop. This will then be copied to four new assembly directories, where we'll remove some mate pair links (the reads are retained).

test-orig
contains all the mate links.
test-AB
contains only mate links from the A and B reads.
test-CD
contains only mate links from the C and D reads.
test-AB+CD
contains only mate links from the A and B reads, but the C and D mate links will be added in the middle of scaffolding.
test-CD+AB
contains only mate links from the C and D reads, but the A and B mate links will be added in the middle of scaffolding.
#!/bin/sh

#  Build initial unitigs, stop after consensus

runCA -p test -d test-orig stopAfter=utgcns \
  useGrid=0 scriptOnGrid=0 \
  unitigger=bogart \
  cnsConcurrency=12 \
  ovlConcurrency=4 \
  porphyromonas_gingivalis_w83.flx.3200bp.0900bp.*frg.xz

#  Copy to test directories

cp -prv test-orig test-AB
cp -prv test-orig test-CD
cp -prv test-orig test-AB+CD
cp -prv test-orig test-CD+AB

#  Delete mates, save orig mates

cp -p test-orig/test.gkpStore/fnm test-orig/test.gkpStore/fnm.orig

echo lib iid 1 allfragsunmated t > delete1
echo lib iid 2 allfragsunmated t > delete2
echo lib iid 3 allfragsunmated t > delete3
echo lib iid 4 allfragsunmated t > delete4

gatekeeper --edit delete1 test-CD/test.gkpStore ; gatekeeper --edit delete2 test-CD/test.gkpStore
gatekeeper --edit delete3 test-AB/test.gkpStore ; gatekeeper --edit delete4 test-AB/test.gkpStore

gatekeeper --edit delete1 test-CD+AB/test.gkpStore ; gatekeeper --edit delete2 test-CD+AB/test.gkpStore
gatekeeper --edit delete3 test-AB+CD/test.gkpStore ; gatekeeper --edit delete4 test-AB+CD/test.gkpStore

#  Finish assemblies

runCA -p test -d test-orig useGrid=1 scriptOnGrid=1

runCA -p test -d test-AB useGrid=1 scriptOnGrid=1
runCA -p test -d test-CD useGrid=1 scriptOnGrid=1

#  Run first CGW, restore mates, finish assembly

runCA -p test -d test-AB+CD stopBefore=ECR useGrid=0 scriptOnGrid=0
runCA -p test -d test-CD+AB stopBefore=ECR useGrid=0 scriptOnGrid=0

cp -fp test-orig/test.gkpStore/fnm.orig test-CD+AB/test.gkpStore/fnm
cp -fp test-orig/test.gkpStore/fnm.orig test-AB+CD/test.gkpStore/fnm

runCA -p test -d test-AB+CD cgwReloadMates=1 useGrid=1 scriptOnGrid=1
runCA -p test -d test-CD+AB cgwReloadMates=1 useGrid=1 scriptOnGrid=1

After these finish, we can make some dot plots to visualize the results.

dotplot.sh CD.CTG    AE015924.fasta test-CD/9-terminator/test.ctg.fasta
dotplot.sh AB.CTG    AE015924.fasta test-AB/9-terminator/test.ctg.fasta
dotplot.sh CD+AB.CTG AE015924.fasta test-CD+AB/9-terminator/test.ctg.fasta
dotplot.sh AB+CD.CTG AE015924.fasta test-AB+CD/9-terminator/test.ctg.fasta
dotplot.sh ABCD.CTG  AE015924.fasta test-orig/9-terminator/test.ctg.fasta

dotplot.sh CD.SCF    AE015924.fasta test-CD/9-terminator/test.scf.fasta
dotplot.sh AB.SCF    AE015924.fasta test-AB/9-terminator/test.scf.fasta
dotplot.sh CD+AB.SCF AE015924.fasta test-CD+AB/9-terminator/test.scf.fasta
dotplot.sh AB+CD.SCF AE015924.fasta test-AB+CD/9-terminator/test.scf.fasta
dotplot.sh ABCD.SCF  AE015924.fasta test-orig/9-terminator/test.scf.fasta


Scaffolds without A,B mate links Scaffolds without C,D mate links Scaffolds with A,B mate links added later Scaffolds with C,D mate links added later Scaffolds with all mate links

And compare QC reports.

Files
test-AB+CD/test.qc test-AB/test.qc test-CD+AB/test.qc test-CD/test.qc test-orig/test.qc
Scaffolds
TotalScaffolds 18 17 13 13 16
TotalContigsInScaffolds 511 503 495 499 502
MeanContigsPerScaffold 28.39 29.59 38.08 38.38 31.38
MinContigsPerScaffold 1 1 1 1 1
MaxContigsPerScaffold 111 109 182 201 201
TotalBasesInScaffolds 2139478 2133332 2143819 2129017 2109421
MeanBasesInScaffolds 118860 125490 164909 163771 131839
MinBasesInScaffolds 250 250 113 517 132
MaxBasesInScaffolds 385563 386701 619430 670491 661492
N25ScaffoldBases 340119 341104 619430 670491 661492
N50ScaffoldBases 292186 278059 300450 297574 294354
N75ScaffoldBases 175225 172698 282606 280888 219649
ScaffoldAt1000000 292420 291408 300450 334922 294354
ScaffoldAt2000000 163806 164514 59451 168303 164228
TotalSpanOfScaffolds 2240781 2232463 2232179 2231662 2220781
MeanSpanOfScaffolds 124488 131321 171706 171666 138799
MinScaffoldSpan 250 250 113 517 132
MaxScaffoldSpan 409223 410851 659740 720706 710870
IntraScaffoldGaps 493 486 482 486 486
2KbScaffolds 9 10 9 9 7
2KbScaffoldSpan 2231059 2225087 2229279 2227865 2212837
MeanSequenceGapLength 205 204 183 211 229
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 98,385563,409223,3934,244,2772 99,386701,410851,3906,246,2795 182,619430,659740,3403,223,2743 201,670491,720706,3336,251,2790 201,661492,710870,3291,247,2797
1 111,340119,367099,3064,245,2771 109,341104,365221,3129,223,2796 61,337290,349369,5529,201,2742 59,334922,347662,5677,220,2791 64,308579,325936,4822,276,2795
2 87,292420,311311,3361,220,2768 84,291408,310183,3469,226,2794 78,300450,313209,3852,166,2738 80,297574,313675,3720,204,2787 75,294354,310999,3925,225,2793
3 45,292186,298053,6493,133,2766 47,278059,284826,5916,147,2797 49,285723,294607,5831,185,2744 48,285491,294483,5948,191,2786 46,284057,289068,6175,111,2796
4 48,283616,291637,5909,171,2773 34,257158,262300,7563,156,2792 45,282606,285617,6280,68,2740 46,280888,285033,6106,92,2789 40,219649,230269,5491,272,2794
total 389,1593904,1677323,4097,217 373,1554430,1633381,4167,215 415,1825499,1902542,4399,188 434,1869366,1961559,4307,215 426,1768131,1867142,4151,235
Contigs
TotalContigsInScaffolds 511 503 495 499 502
TotalBasesInScaffolds 2139478 2133332 2143819 2129017 2109421
TotalVarRecords 687 657 683 648 659
MeanContigLength 4187 4241 4331 4267 4202
MinContigLength 83 122 113 122 68
MaxContigLength 36338 36339 36338 36339 36338
N25ContigBases 12678 12368 12805 12759 12443
N50ContigBases 7620 7371 6940 7041 6699
N75ContigBases 3755 3751 3953 3792 3591
ContigAt1000000 8074 7873 7769 7769 7344
ContigAt2000000 1655 1693 1756 1665 1564
BigContigs_greater_10000
TotalBigContigs 55 53 52 50 50
BigContigLength 808791 775358 801251 774218 746927
MeanBigContigLength 14705 14629 15409 15484 14939
MinBigContig 10062 10062 10062 10062 10062
MaxBigContig 36338 36339 36338 36339 36338
BigContigsPercentBases 37.80 36.34 37.37 36.37 35.41
SmallContigs
TotalSmallContigs 456 450 443 449 452
SmallContigLength 1330687 1357974 1342568 1354799 1362494
MeanSmallContigLength 2918 3018 3031 3017 3014
MinSmallContig 83 122 113 122 68
MaxSmallContig 9873 9873 9728 9908 9864
SmallContigsPercentBases 62.20 63.66 62.63 63.63 64.59
DegenContigs
TotalDegenContigs 392 432 383 431 421
DegenContigLength 173590 178722 166797 181286 194472
DegenVarRecords 85 87 55 69 74
MeanDegenContigLength 443 414 436 421 462
MinDegenContig 64 65 64 64 64
MaxDegenContig 4101 4101 998 998 997
DegenPercentBases 8.11 8.38 7.78 8.52 9.22
Top5Contigs=reads,bases,EUID
0 6417,36338,2600 6365,36339,2645 6417,36338,2721 6371,36339,2629 6417,36338,2672
1 5313,33223,2537 5313,33223,2565 5313,33223,2430 5313,33223,2482 5313,33223,2535
2 4146,27587,2614 4058,27587,2657 4761,32646,2644 4708,32646,2761 4761,32646,2583
3 3461,24314,2708 3388,25036,2708 4153,27587,2654 4037,27210,2770 4106,27206,2648
4 3454,23458,2651 3449,23458,2631 3486,26681,2585 3414,26603,2643 3469,26657,2606
total 22791,144920 22573,145643 24130,156475 23843,156021 24066,156070
UniqueUnitigs
TotalUUnitigs 1142 1116 1115 1104 1032
MinUUnitigLength 65 68 65 65 65
MaxUUnitigLength 16668 16668 16668 16668 16668
MeanUUnitigLength 1767 1805 1789 1807 1881
SDUUnitigLength 1836 1840 1853 1854 1897
Surrogates
TotalSurrogates 319 298 356 313 408
SurrogateInstances 539 496 596 516 653
SurrogateLength 130770 128181 161184 145647 186546
SurrogateInstanceLength 232257 221192 265296 240159 280311
UnPlacedSurrReadLen 1019147 1926852 1198710 2040267 1391364
PlacedSurrReadLen 1529256 596971 1841867 837702 1938646
MinSurrogateLength 64 64 64 64 64
MaxSurrogateLength 6013 6013 6013 6013 6013
MeanSurrogateLength 410 430 453 465 457
SDSurrogateLength 433 442 462 479 433
Mates
ReadsWithNoMate 10225(3.33%) 174585(56.80%) 10225(3.33%) 142991(46.52%) 10225(3.33%)
ReadsWithGoodMate 225550(73.39%) 100914(32.83%) 227022(73.86%) 124420(40.48%) 218880(71.21%)
ReadsWithBadShortMate 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%)
ReadsWithBadLongMate 784(0.26%) 306(0.10%) 934(0.30%) 546(0.18%) 906(0.29%)
ReadsWithSameOrientMate 1500(0.49%) 576(0.19%) 1968(0.64%) 1072(0.35%) 1966(0.64%)
ReadsWithOuttieMate 924(0.30%) 372(0.12%) 1134(0.37%) 710(0.23%) 1114(0.36%)
ReadsWithBothChaffMate 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%)
ReadsWithChaffMate 124(0.04%) 44(0.01%) 122(0.04%) 64(0.02%) 108(0.04%)
ReadsWithBothDegenMate 3750(1.22%) 1998(0.65%) 3096(1.01%) 2596(0.84%) 4862(1.58%)
ReadsWithDegenMate 31956(10.40%) 15012(4.88%) 30406(9.89%) 18876(6.14%) 36110(11.75%)
ReadsWithBothSurrMate 2254(0.73%) 970(0.32%) 2542(0.83%) 1302(0.42%) 2380(0.77%)
ReadsWithSurrogateMate 7868(2.56%) 3590(1.17%) 9384(3.05%) 4648(1.51%) 10064(3.27%)
ReadsWithDiffScafMate 22416(7.29%) 8984(2.92%) 20518(6.68%) 10126(3.29%) 20736(6.75%)
ReadsWithUnassignedMate 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%)
TotalScaffoldLinks 595 226 559 308 680
MeanScaffoldLinkWeight 3.96 4.51 4.01 4.16 5.08
Reads
TotalReadsInput NA NA NA NA NA
TotalUsableReads 307351 307351 307351 307351 307351
AvgClearRange 113 113 113 113 113
ContigReads 279085(90.80%) 270661(88.06%) 279160(90.83%) 270272(87.94%) 274296(89.25%)
BigContigReads 113137(36.81%) 107252(34.90%) 112928(36.74%) 107245(34.89%) 104577(34.03%)
SmallContigReads 165948(53.99%) 163409(53.17%) 166232(54.09%) 163027(53.04%) 169719(55.22%)
DegenContigReads 19359(6.30%) 19654(6.39%) 17701(5.76%) 19007(6.18%) 20833(6.78%)
SurrogateReads 22584(7.35%) 22366(7.28%) 26989(8.78%) 25493(8.29%) 29674(9.65%)
PlacedSurrogateReads 13743(4.47%) 5403(1.76%) 16564(5.39%) 7492(2.44%) 17510(5.70%)
SingletonReads 66(0.02%) 73(0.02%) 65(0.02%) 71(0.02%) 58(0.02%)
ChaffReads 66(0.02%) 73(0.02%) 65(0.02%) 71(0.02%) 58(0.02%)
Coverage
ContigsOnly 14.71 14.31 14.68 14.32 14.67
Contigs_Surrogates 15.19 15.22 15.24 15.28 15.33
Contigs_Degens_Surrogates 14.97 14.97 14.98 14.99 15.03
AllReads 16.18 16.23 16.15 16.26 16.41
TotalBaseCounts
BasesCount NA NA NA NA NA
ClearRangeLengthFRG NA NA NA NA NA
ClearRangeLengthASM 34626664 34626500 34626241 34626258 34626128
SurrogateBaseLength 2548403 2523823 3040577 2877969 3330010
ContigBaseLength 31471748 30535828 31478009 30494153 30935975
DegenBaseLength 2129908 2157277 1943933 2085644 2293709
SingletonBaseLength 5861 6543 5589 6194 5080
Contig_SurrBaseLength 32490895 32462680 32676719 32534420 32327339
gcContent
Content 47.82 47.82 47.82 47.81 47.79
Unitig Consensus
NumColumnsInUnitigs 6978417 6978417 6978417 6978417 6978417
NumGapsInUnitigs 43666 43666 43666 43666 43666
NumRunsOfGapsInUnitigReads 860999 860999 860999 860999 860999
Contig Consensus
NumColumnsInUnitigs 2327867 2326403 2325145 2324365 2318479
NumGapsInUnitigs 14799 14349 14532 14063 14586
NumRunsOfGapsInUnitigReads 252803 243663 248622 239288 247412
NumColumnsInContigs 2327785 2326327 2325064 2324289 2318399
NumGapsInContigs 14716 14272 14447 13986 14506
NumRunsOfGapsInContigReads 249371 240449 245249 236096 243905
NumAAMismatches 1139 1093 779 752 1077
NumVARStringsWithFlankingGaps 65 63 57 59 71