Porphyromonas gingivalis W83, using 454 3 Kbp mated reads, with CA8.2

From wgs-assembler
Jump to: navigation, search

We will assemble our favorite test case, Porphyromonas gingivalis, from two lanes of 454 3 Kbp mate pair reads. We'll discard the unmated reads.

Rather than write up the conversion, here are the reads.

Or, with curl:

curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS01.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS01.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS02.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS02.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO01.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO01.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO02.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO02.mate.frg.xz

The following script will run the assemblies. It will run the 'test-orig' assembly to the end of unitig construction, including consensus, then stop. This will then be copied to four new assembly directories, where we'll remove some mate pair links (the reads are retained).

test-orig
contains all the mate links.
test-AB
contains only mate links from the A and B reads.
test-CD
contains only mate links from the C and D reads.
test-AB+CD
contains only mate links from the A and B reads, but the C and D mate links will be added in the middle of scaffolding.
test-CD+AB
contains only mate links from the C and D reads, but the A and B mate links will be added in the middle of scaffolding.
#!/bin/sh

#  Build initial unitigs, stop after consensus

runCA -p test -d test-orig stopAfter=utgcns \
  useGrid=0 scriptOnGrid=0 \
  unitigger=bogart \
  cnsConcurrency=12 \
  ovlConcurrency=4 \
  porphyromonas_gingivalis_w83.flx.3200bp.0900bp.*frg.xz

#  Copy to test directories

cp -prv test-orig test-AB
cp -prv test-orig test-CD
cp -prv test-orig test-AB+CD
cp -prv test-orig test-CD+AB

#  Delete mates, save orig mates

cp -p test-orig/test.gkpStore/fnm test-orig/test.gkpStore/fnm.orig

echo lib iid 1 allfragsunmated t > delete1
echo lib iid 2 allfragsunmated t > delete2
echo lib iid 3 allfragsunmated t > delete3
echo lib iid 4 allfragsunmated t > delete4

gatekeeper --edit delete1 test-CD/test.gkpStore ; gatekeeper --edit delete2 test-CD/test.gkpStore
gatekeeper --edit delete3 test-AB/test.gkpStore ; gatekeeper --edit delete4 test-AB/test.gkpStore

gatekeeper --edit delete1 test-CD+AB/test.gkpStore ; gatekeeper --edit delete2 test-CD+AB/test.gkpStore
gatekeeper --edit delete3 test-AB+CD/test.gkpStore ; gatekeeper --edit delete4 test-AB+CD/test.gkpStore

#  Finish assemblies

runCA -p test -d test-orig useGrid=1 scriptOnGrid=1

runCA -p test -d test-AB useGrid=1 scriptOnGrid=1
runCA -p test -d test-CD useGrid=1 scriptOnGrid=1

#  Run first CGW, restore mates, finish assembly

runCA -p test -d test-AB+CD stopBefore=ECR useGrid=0 scriptOnGrid=0
runCA -p test -d test-CD+AB stopBefore=ECR useGrid=0 scriptOnGrid=0

cp -fp test-orig/test.gkpStore/fnm.orig test-CD+AB/test.gkpStore/fnm
cp -fp test-orig/test.gkpStore/fnm.orig test-AB+CD/test.gkpStore/fnm

runCA -p test -d test-AB+CD cgwReloadMates=1 useGrid=1 scriptOnGrid=1
runCA -p test -d test-CD+AB cgwReloadMates=1 useGrid=1 scriptOnGrid=1

After these finish, we can make some dot plots to visualize the results.

dotplot.sh CD.CTG    AE015924.fasta test-CD/9-terminator/test.ctg.fasta
dotplot.sh AB.CTG    AE015924.fasta test-AB/9-terminator/test.ctg.fasta
dotplot.sh CD+AB.CTG AE015924.fasta test-CD+AB/9-terminator/test.ctg.fasta
dotplot.sh AB+CD.CTG AE015924.fasta test-AB+CD/9-terminator/test.ctg.fasta
dotplot.sh ABCD.CTG  AE015924.fasta test-orig/9-terminator/test.ctg.fasta

dotplot.sh CD.SCF    AE015924.fasta test-CD/9-terminator/test.scf.fasta
dotplot.sh AB.SCF    AE015924.fasta test-AB/9-terminator/test.scf.fasta
dotplot.sh CD+AB.SCF AE015924.fasta test-CD+AB/9-terminator/test.scf.fasta
dotplot.sh AB+CD.SCF AE015924.fasta test-AB+CD/9-terminator/test.scf.fasta
dotplot.sh ABCD.SCF  AE015924.fasta test-orig/9-terminator/test.scf.fasta


Scaffolds without A,B mate links Scaffolds without C,D mate links Scaffolds with A,B mate links added later Scaffolds with C,D mate links added later Scaffolds with all mate links

And compare QC reports.

Files
test-AB+CD/test.qc test-AB/test.qc test-CD+AB/test.qc test-CD/test.qc test-orig/test.qc
Scaffolds
TotalScaffolds 14 85 23 23 29
TotalContigsInScaffolds 433 501 428 433 441
MeanContigsPerScaffold 30.93 5.89 18.61 18.83 15.21
MinContigsPerScaffold 1 1 1 1 1
MaxContigsPerScaffold 92 95 139 140 135
TotalBasesInScaffolds 2199902 2204929 2200060 2187851 2160642
MeanBasesInScaffolds 157136 25940 95655 95124 74505
MinBasesInScaffolds 250 64 69 66 65
MaxBasesInScaffolds 358943 399071 548136 539203 530386
N25ScaffoldBases 343859 355721 340816 335577 316853
N50ScaffoldBases 293230 337740 308097 308497 304762
N75ScaffoldBases 240361 283998 282750 251670 218676
ScaffoldAt1000000 340052 344903 308097 308497 304762
ScaffoldAt2000000 180074 175613 156283 156263 147318
TotalSpanOfScaffolds 2270053 2269569 2269376 2260853 2249539
MeanSpanOfScaffolds 162147 26701 98669 98298 77570
MinScaffoldSpan 250 64 69 66 65
MaxScaffoldSpan 373369 414184 574729 568754 561885
IntraScaffoldGaps 419 416 405 410 412
2KbScaffolds 8 8 9 10 9
2KbScaffoldSpan 2263613 2255485 2260538 2252297 2237251
MeanSequenceGapLength 167 155 171 178 216
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 92,358943,373369,3902,159,3993 86,399071,414184,4640,178,3523 139,548136,574729,3943,193,3812 140,539203,568754,3851,213,3476 135,530386,561885,3929,235,3961
1 54,343859,352964,6368,172,3994 89,355721,370143,3997,164,3575 49,340816,353997,6955,275,3817 48,335577,347723,6991,258,3474 58,316853,328568,5463,206,3948
2 91,340052,355265,3737,169,3990 46,344903,352734,7498,174,3525 71,308097,318810,4339,153,3808 73,308497,320026,4226,160,3471 69,304762,317842,4417,192,3947
3 34,293230,300078,8624,208,3992 95,337740,355081,3555,184,3521 39,292743,297179,7506,117,3811 40,282669,286675,7067,103,3473 43,276732,282410,6436,135,3951
4 43,280238,285955,6517,136,3989 41,283998,286888,6927,72,3524 41,282750,285903,6896,79,3810 29,251670,254649,8678,106,3472 33,218676,229462,6627,337,3950
total 314,1616322,1667631,5148,166 357,1721433,1779030,4822,164 339,1772542,1830618,5229,174 330,1717616,1777827,5205,185 338,1647409,1720167,4874,218
Contigs
TotalContigsInScaffolds 433 501 428 433 441
TotalBasesInScaffolds 2199902 2204929 2200060 2187851 2160642
TotalVarRecords 827 744 807 742 802
MeanContigLength 5081 4401 5140 5053 4899
MinContigLength 89 64 68 66 65
MaxContigLength 48016 48016 48016 48016 36418
N25ContigBases 15775 15519 15092 16008 13723
N50ContigBases 9778 8959 9110 9110 8134
N75ContigBases 4838 4826 4896 4680 4623
ContigAt1000000 11440 10597 10534 10579 8959
ContigAt2000000 2521 2492 2569 2517 2178
BigContigs_greater_10000
TotalBigContigs 67 63 63 62 56
BigContigLength 1084382 1023969 1027589 1032611 893722
MeanBigContigLength 16185 16253 16311 16655 15959
MinBigContig 10406 10368 10252 10081 10131
MaxBigContig 48016 48016 48016 48016 36418
BigContigsPercentBases 49.29 46.44 46.71 47.20 41.36
SmallContigs
TotalSmallContigs 366 438 365 371 385
SmallContigLength 1115520 1180960 1172471 1155240 1266920
MeanSmallContigLength 3048 2696 3212 3114 3291
MinSmallContig 89 64 68 66 65
MaxSmallContig 9873 9885 9782 9908 9894
SmallContigsPercentBases 50.71 53.56 53.29 52.80 58.64
DegenContigs
TotalDegenContigs 367 409 365 434 407
DegenContigLength 129567 133287 134095 145266 163785
DegenVarRecords 64 65 61 60 51
MeanDegenContigLength 353 326 367 335 402
MinDegenContig 65 65 65 65 65
MaxDegenContig 4088 4088 4088 996 997
DegenPercentBases 5.89 6.04 6.10 6.64 7.58
Top5Contigs=reads,bases,EUID
0 8154,48016,3891 8020,48016,3432 8154,48016,3717 8029,48016,3321 6601,36418,3768
1 5332,33223,3894 5331,33223,3435 4980,33869,3649 4953,33869,3352 5144,35088,3693
2 4120,28550,3792 3944,28550,3375 5323,33223,3720 5323,33223,3319 5333,33223,3765
3 4220,27594,3961 4099,27593,3363 4221,27593,3663 4083,27211,3363 4213,27594,3821
4 3687,25746,3969 3508,25746,3357 3525,26658,3614 3441,26581,3284 3547,26658,3719
total 25513,163129 24902,163128 26203,169359 25829,168900 24838,158981
UniqueUnitigs
TotalUUnitigs 2365 1785 2159 1749 2154
MinUUnitigLength 64 64 64 64 64
MaxUUnitigLength 16676 16676 16676 16676 16676
MeanUUnitigLength 914 1180 982 1192 961
SDUUnitigLength 1531 1679 1584 1693 1594
Surrogates
TotalSurrogates 448 410 480 410 526
SurrogateInstances 773 665 793 675 828
SurrogateLength 178159 167290 196933 180448 225278
SurrogateInstanceLength 302899 278370 326667 297507 337734
UnPlacedSurrReadLen 1070908 2189165 1096411 2242448 1374995
PlacedSurrReadLen 2048107 774923 2271876 1021913 2368143
MinSurrogateLength 64 64 64 64 64
MaxSurrogateLength 6031 6031 6031 6031 6031
MeanSurrogateLength 398 408 410 440 428
SDSurrogateLength 398 405 386 441 405
Mates
ReadsWithNoMate 5561(1.78%) 175243(55.94%) 5561(1.78%) 143601(45.84%) 5561(1.78%)
ReadsWithGoodMate 240616(76.80%) 108340(34.58%) 238954(76.27%) 131304(41.91%) 230322(73.52%)
ReadsWithBadShortMate 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%)
ReadsWithBadLongMate 846(0.27%) 338(0.11%) 890(0.28%) 422(0.13%) 806(0.26%)
ReadsWithSameOrientMate 1688(0.54%) 664(0.21%) 1810(0.58%) 922(0.29%) 1724(0.55%)
ReadsWithOuttieMate 964(0.31%) 408(0.13%) 962(0.31%) 520(0.17%) 980(0.31%)
ReadsWithBothChaffMate 990(0.32%) 476(0.15%) 990(0.32%) 514(0.16%) 990(0.32%)
ReadsWithChaffMate 2410(0.77%) 930(0.30%) 2762(0.88%) 1392(0.44%) 2596(0.83%)
ReadsWithBothDegenMate 2778(0.89%) 1698(0.54%) 3194(1.02%) 2370(0.76%) 4544(1.45%)
ReadsWithDegenMate 24844(7.93%) 11766(3.76%) 25218(8.05%) 15666(5.00%) 31260(9.98%)
ReadsWithBothSurrMate 2200(0.70%) 948(0.30%) 2216(0.71%) 1260(0.40%) 2198(0.70%)
ReadsWithSurrogateMate 7880(2.52%) 3438(1.10%) 7544(2.41%) 4226(1.35%) 9300(2.97%)
ReadsWithDiffScafMate 22506(7.18%) 9034(2.88%) 23182(7.40%) 11086(3.54%) 23002(7.34%)
ReadsWithUnassignedMate 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%)
TotalScaffoldLinks 488 39 306 204 232
MeanScaffoldLinkWeight 3.71 3.13 5.27 5.55 8.97
Reads
TotalReadsInput NA NA NA NA NA
TotalUsableReads 313283 313283 313283 313283 313283
AvgClearRange 113 113 113 113 113
ContigReads 287950(91.91%) 277021(88.43%) 287059(91.63%) 276064(88.12%) 281661(89.91%)
BigContigReads 151145(48.25%) 139046(44.38%) 142960(45.63%) 139098(44.40%) 123494(39.42%)
SmallContigReads 136805(43.67%) 137975(44.04%) 144099(46.00%) 136966(43.72%) 158167(50.49%)
DegenContigReads 13771(4.40%) 14067(4.49%) 14253(4.55%) 14547(4.64%) 17283(5.52%)
SurrogateReads 27796(8.87%) 26413(8.43%) 30020(9.58%) 29010(9.26%) 33426(10.67%)
PlacedSurrogateReads 18476(5.90%) 7036(2.25%) 20467(6.53%) 9167(2.93%) 21422(6.84%)
SingletonReads 2242(0.72%) 2818(0.90%) 2418(0.77%) 2829(0.90%) 2335(0.75%)
ChaffReads 2242(0.72%) 2818(0.90%) 2418(0.77%) 2829(0.90%) 2335(0.75%)
Coverage
ContigsOnly 14.75 14.16 14.70 14.23 14.69
Contigs_Surrogates 15.23 15.16 15.20 15.25 15.33
Contigs_Degens_Surrogates 15.04 14.95 15.00 14.99 15.07
AllReads 16.03 16.00 16.03 16.12 16.32
TotalBaseCounts
BasesCount NA NA NA NA NA
ClearRangeLengthFRG NA NA NA NA NA
ClearRangeLengthASM 35272071 35272162 35272090 35272150 35271815
SurrogateBaseLength 3119015 2964088 3368287 3264361 3743138
ContigBaseLength 32440401 31229095 32341921 31130543 31740175
DegenBaseLength 1519006 1549705 1572712 1597413 1904215
SingletonBaseLength 241756 304197 261046 301746 252430
Contig_SurrBaseLength 33511309 33418260 33438332 33372991 33115170
gcContent
Content 47.86 47.87 47.84 47.85 47.81
Unitig Consensus
NumColumnsInUnitigs 8130403 8130403 8130403 8130403 8130403
NumGapsInUnitigs 48256 48256 48256 48256 48256
NumRunsOfGapsInUnitigReads 948215 948215 948215 948215 948215
Contig Consensus
NumColumnsInUnitigs 2346287 2354295 2350907 2348994 2341035
NumGapsInUnitigs 16817 16078 16751 15877 16607
NumRunsOfGapsInUnitigReads 287426 272696 286214 266132 283919
NumColumnsInContigs 2346163 2354185 2350790 2348884 2340928
NumGapsInContigs 16693 15968 16636 15767 16500
NumRunsOfGapsInContigReads 281993 267903 281251 261532 278689
NumAAMismatches 995 870 945 857 912
NumVARStringsWithFlankingGaps 91 74 87 76 90