Porphyromonas gingivalis W83, using 454 3 Kbp mated reads, with CA8.1

From wgs-assembler
Jump to: navigation, search

We will assemble our favorite test case, Porphyromonas gingivalis, from two lanes of 454 3 Kbp mate pair reads. We'll discard the unmated reads.

Rather than write up the conversion, here are the reads.

Or, with curl:

curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS01.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS01.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS02.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS02.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO01.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO01.mate.frg.xz
curl -L -o porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO02.mate.frg.xz http://sourceforge.net/projects/wgs-assembler/files/wgs-assembler/wgs-8.0/datasets/porphyromonas_gingivalis_w83.flx.3200bp.0900bp.FJRUAFO02.mate.frg.xz

The following script will run the assemblies. It will run the 'test-orig' assembly to the end of unitig construction, including consensus, then stop. This will then be copied to four new assembly directories, where we'll remove some mate pair links (the reads are retained).

test-orig
contains all the mate links.
test-AB
contains only mate links from the A and B reads.
test-CD
contains only mate links from the C and D reads.
test-AB+CD
contains only mate links from the A and B reads, but the C and D mate links will be added in the middle of scaffolding.
test-CD+AB
contains only mate links from the C and D reads, but the A and B mate links will be added in the middle of scaffolding.
#!/bin/sh

#  Build initial unitigs, stop after consensus

runCA -p test -d test-orig stopAfter=utgcns \
  useGrid=0 scriptOnGrid=0 \
  unitigger=bogart \
  cnsConcurrency=12 \
  ovlConcurrency=4 \
  porphyromonas_gingivalis_w83.flx.3200bp.0900bp.*frg.xz

#  Copy to test directories

cp -prv test-orig test-AB
cp -prv test-orig test-CD
cp -prv test-orig test-AB+CD
cp -prv test-orig test-CD+AB

#  Delete mates, save orig mates

cp -p test-orig/test.gkpStore/fnm test-orig/test.gkpStore/fnm.orig

echo lib iid 1 allfragsunmated t > delete1
echo lib iid 2 allfragsunmated t > delete2
echo lib iid 3 allfragsunmated t > delete3
echo lib iid 4 allfragsunmated t > delete4

gatekeeper --edit delete1 test-CD/test.gkpStore ; gatekeeper --edit delete2 test-CD/test.gkpStore
gatekeeper --edit delete3 test-AB/test.gkpStore ; gatekeeper --edit delete4 test-AB/test.gkpStore

gatekeeper --edit delete1 test-CD+AB/test.gkpStore ; gatekeeper --edit delete2 test-CD+AB/test.gkpStore
gatekeeper --edit delete3 test-AB+CD/test.gkpStore ; gatekeeper --edit delete4 test-AB+CD/test.gkpStore

#  Finish assemblies

runCA -p test -d test-orig useGrid=1 scriptOnGrid=1

runCA -p test -d test-AB useGrid=1 scriptOnGrid=1
runCA -p test -d test-CD useGrid=1 scriptOnGrid=1

#  Run first CGW, restore mates, finish assembly

runCA -p test -d test-AB+CD stopBefore=ECR useGrid=0 scriptOnGrid=0
runCA -p test -d test-CD+AB stopBefore=ECR useGrid=0 scriptOnGrid=0

cp -fp test-orig/test.gkpStore/fnm.orig test-CD+AB/test.gkpStore/fnm
cp -fp test-orig/test.gkpStore/fnm.orig test-AB+CD/test.gkpStore/fnm

runCA -p test -d test-AB+CD cgwReloadMates=1 useGrid=1 scriptOnGrid=1
runCA -p test -d test-CD+AB cgwReloadMates=1 useGrid=1 scriptOnGrid=1

After these finish, we can make some dot plots to visualize the results.

dotplot.sh CD.CTG    AE015924.fasta test-CD/9-terminator/test.ctg.fasta
dotplot.sh AB.CTG    AE015924.fasta test-AB/9-terminator/test.ctg.fasta
dotplot.sh CD+AB.CTG AE015924.fasta test-CD+AB/9-terminator/test.ctg.fasta
dotplot.sh AB+CD.CTG AE015924.fasta test-AB+CD/9-terminator/test.ctg.fasta
dotplot.sh ABCD.CTG  AE015924.fasta test-orig/9-terminator/test.ctg.fasta

dotplot.sh CD.SCF    AE015924.fasta test-CD/9-terminator/test.scf.fasta
dotplot.sh AB.SCF    AE015924.fasta test-AB/9-terminator/test.scf.fasta
dotplot.sh CD+AB.SCF AE015924.fasta test-CD+AB/9-terminator/test.scf.fasta
dotplot.sh AB+CD.SCF AE015924.fasta test-AB+CD/9-terminator/test.scf.fasta
dotplot.sh ABCD.SCF  AE015924.fasta test-orig/9-terminator/test.scf.fasta


Scaffolds without A,B mate links Scaffolds without C,D mate links Scaffolds with A,B mate links added later Scaffolds with C,D mate links added later Scaffolds with all mate links

And compare QC reports.

Files
test-AB+CD/test.qc test-AB/test.qc test-CD+AB/test.qc test-CD/test.qc test-orig/test.qc
Scaffolds
TotalScaffolds 14 85 23 23 29
TotalContigsInScaffolds 433 501 429 433 441
MeanContigsPerScaffold 30.93 5.89 18.65 18.83 15.21
MinContigsPerScaffold 1 1 1 1 1
MaxContigsPerScaffold 92 95 139 140 135
TotalBasesInScaffolds 2199902 2204930 2199624 2187407 2160649
MeanBasesInScaffolds 157136 25940 95636 95105 74505
MinBasesInScaffolds 250 64 69 66 65
MaxBasesInScaffolds 358943 399072 547694 538759 530393
N25ScaffoldBases 343859 355721 340816 335577 316853
N50ScaffoldBases 293230 337740 308103 308497 304762
N75ScaffoldBases 240361 283998 282750 251670 218676
ScaffoldAt1000000 340052 344903 308103 308497 304762
ScaffoldAt2000000 180074 175613 156283 156263 147318
TotalSpanOfScaffolds 2270053 2269570 2268919 2260408 2249545
MeanSpanOfScaffolds 162147 26701 98649 98279 77571
MinScaffoldSpan 250 64 69 66 65
MaxScaffoldSpan 373369 414185 574287 568310 561891
IntraScaffoldGaps 419 416 406 410 412
2KbScaffolds 8 8 9 10 9
2KbScaffoldSpan 2263613 2255486 2260081 2251852 2237257
MeanSequenceGapLength 167 155 171 178 216
Top5Scaffolds=contigs,size,span,avgContig,avgGap
0 92,358943,373369,3902,159 86,399072,414185,4640,178 139,547694,574287,3940,193 140,538759,568310,3848,213 135,530393,561891,3929,235
1 54,343859,352964,6368,172 89,355721,370143,3997,164 49,340816,353997,6955,275 48,335577,347723,6991,258 58,316853,328568,5463,206
2 91,340052,355265,3737,169 46,344903,352734,7498,174 72,308103,318796,4279,151 73,308497,320026,4226,160 69,304762,317842,4417,192
3 34,293230,300078,8624,208 95,337740,355081,3555,184 39,292743,297178,7506,117 40,282669,286675,7067,103 43,276732,282410,6436,135
4 43,280238,285955,6517,136 41,283998,286888,6927,72 41,282750,285903,6896,79 29,251670,254648,8678,106 33,218676,229462,6627,337
total 314,1616322,1667631,5148,166 357,1721434,1779031,4822,164 340,1772106,1830161,5212,173 330,1717172,1777382,5204,185 338,1647416,1720173,4874,218
Contigs
TotalContigsInScaffolds 433 501 429 433 441
TotalBasesInScaffolds 2199902 2204930 2199624 2187407 2160649
TotalVarRecords 826 742 807 742 798
MeanContigLength 5081 4401 5127 5052 4899
MinContigLength 89 64 68 66 65
MaxContigLength 48016 48016 48016 48016 36418
N25ContigBases 15775 15519 15092 16008 13723
N50ContigBases 9778 8959 9054 9110 8134
N75ContigBases 4838 4826 4830 4680 4623
ContigAt1000000 11440 10597 10321 10579 8959
ContigAt2000000 2521 2492 2565 2517 2178
BigContigs_greater_10000
TotalBigContigs 67 63 62 62 56
BigContigLength 1084382 1023969 1016825 1032611 893722
MeanBigContigLength 16185 16253 16400 16655 15959
MinBigContig 10406 10368 10252 10081 10131
MaxBigContig 48016 48016 48016 48016 36418
BigContigsPercentBases 49.29 46.44 46.23 47.21 41.36
SmallContigs
TotalSmallContigs 366 438 367 371 385
SmallContigLength 1115520 1180961 1182799 1154796 1266927
MeanSmallContigLength 3048 2696 3223 3113 3291
MinSmallContig 89 64 68 66 65
MaxSmallContig 9873 9885 9782 9908 9894
SmallContigsPercentBases 50.71 53.56 53.77 52.79 58.64
DegenContigs
TotalDegenContigs 367 408 367 436 408
DegenContigLength 129584 133186 134353 145524 163903
DegenVarRecords 64 65 63 62 51
MeanDegenContigLength 353 326 366 334 402
MinDegenContig 65 65 65 65 65
MaxDegenContig 4088 4088 4088 996 997
DegenPercentBases 5.89 6.04 6.11 6.65 7.59
Top5Contigs=reads,bases
0 8154,48016 8020,48016 8154,48016 8029,48016 6601,36418
1 5332,33223 5331,33223 4980,33869 4953,33869 5144,35088
2 4120,28550 3944,28550 5323,33223 5323,33223 5333,33223
3 4220,27594 4099,27593 4221,27593 4083,27211 4213,27594
4 3687,25746 3508,25746 3525,26658 3441,26581 3547,26658
total 25513,163129 24902,163128 26203,169359 25829,168900 24838,158981
UniqueUnitigs
TotalUUnitigs 2365 1785 2158 1748 2152
MinUUnitigLength 64 64 64 64 64
MaxUUnitigLength 16676 16676 16676 16676 16676
MeanUUnitigLength 914 1180 982 1192 962
SDUUnitigLength 1531 1679 1584 1694 1595
Surrogates
TotalSurrogates 448 411 479 409 527
SurrogateInstances 773 666 792 673 829
SurrogateLength 178087 167336 196666 180181 225296
SurrogateInstanceLength 302902 278493 326102 296830 337824
UnPlacedSurrReadLen 1070908 2189549 1095598 2240670 1374995
PlacedSurrReadLen 2048107 774989 2270389 1021391 2368859
MinSurrogateLength 64 64 64 64 64
MaxSurrogateLength 6031 6031 6031 6031 6031
MeanSurrogateLength 398 407 411 441 428
SDSurrogateLength 397 404 386 441 404
Mates
ReadsWithNoMate 5561(1.78%) 175243(55.94%) 5561(1.78%) 143601(45.84%) 5561(1.78%)
ReadsWithGoodMate 240616(76.80%) 108342(34.58%) 238924(76.26%) 131290(41.91%) 230316(73.52%)
ReadsWithBadShortMate 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%)
ReadsWithBadLongMate 846(0.27%) 338(0.11%) 890(0.28%) 422(0.13%) 806(0.26%)
ReadsWithSameOrientMate 1688(0.54%) 664(0.21%) 1810(0.58%) 922(0.29%) 1724(0.55%)
ReadsWithOuttieMate 964(0.31%) 408(0.13%) 962(0.31%) 520(0.17%) 980(0.31%)
ReadsWithBothChaffMate 990(0.32%) 476(0.15%) 990(0.32%) 514(0.16%) 990(0.32%)
ReadsWithChaffMate 2410(0.77%) 930(0.30%) 2762(0.88%) 1392(0.44%) 2596(0.83%)
ReadsWithBothDegenMate 2778(0.89%) 1698(0.54%) 3194(1.02%) 2374(0.76%) 4544(1.45%)
ReadsWithDegenMate 24848(7.93%) 11764(3.76%) 25256(8.06%) 15680(5.01%) 31270(9.98%)
ReadsWithBothSurrMate 2200(0.70%) 948(0.30%) 2208(0.70%) 1254(0.40%) 2198(0.70%)
ReadsWithSurrogateMate 7880(2.52%) 3440(1.10%) 7548(2.41%) 4228(1.35%) 9300(2.97%)
ReadsWithDiffScafMate 22502(7.18%) 9032(2.88%) 23178(7.40%) 11086(3.54%) 22998(7.34%)
ReadsWithUnassignedMate 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%) 0(0.00%)
TotalScaffoldLinks 488 39 310 202 232
MeanScaffoldLinkWeight 3.71 3.13 5.22 5.57 8.97
Reads
TotalReadsInput NA NA NA NA NA
TotalUsableReads 313283 313283 313283 313283 313283
AvgClearRange 113 113 113 113 113
ContigReads 287948(91.91%) 277020(88.42%) 287042(91.62%) 276056(88.12%) 281656(89.90%)
BigContigReads 151145(48.25%) 139046(44.38%) 141780(45.26%) 139098(44.40%) 123494(39.42%)
SmallContigReads 136803(43.67%) 137974(44.04%) 145262(46.37%) 136958(43.72%) 158162(50.49%)
DegenContigReads 13773(4.40%) 14064(4.49%) 14278(4.56%) 14572(4.65%) 17288(5.52%)
SurrogateReads 27796(8.87%) 26418(8.43%) 29998(9.58%) 28988(9.25%) 33434(10.67%)
PlacedSurrogateReads 18476(5.90%) 7037(2.25%) 20453(6.53%) 9162(2.92%) 21430(6.84%)
SingletonReads 2242(0.72%) 2818(0.90%) 2418(0.77%) 2829(0.90%) 2335(0.75%)
ChaffReads 2242(0.72%) 2818(0.90%) 2418(0.77%) 2829(0.90%) 2335(0.75%)
Coverage
ContigsOnly 14.75 14.16 14.70 14.23 14.69
Contigs_Surrogates 15.23 15.16 15.20 15.26 15.33
Contigs_Degens_Surrogates 15.04 14.96 15.00 14.99 15.07
AllReads 16.03 16.00 16.04 16.13 16.32
TotalBaseCounts
BasesCount NA NA NA NA NA
ClearRangeLengthFRG NA NA NA NA NA
ClearRangeLengthASM 35272071 35272162 35272049 35272150 35271815
SurrogateBaseLength 3119015 2964538 3365987 3262061 3743854
ContigBaseLength 32440217 31228977 32340127 31129755 31739725
DegenBaseLength 1519190 1549439 1575278 1599979 1904665
SingletonBaseLength 241756 304197 261046 301746 252430
Contig_SurrBaseLength 33511125 33418526 33435725 33370425 33114720
gcContent
Content 47.86 47.87 47.84 47.85 47.81
Unitig Consensus
NumColumnsInUnitigs 8130079 8130079 8130079 8130079 8130079
NumGapsInUnitigs 48037 48037 48037 48037 48037
NumRunsOfGapsInUnitigReads 914477 914477 914477 914477 914477
Contig Consensus
NumColumnsInUnitigs 2346305 2354195 2350726 2348810 2341160
NumGapsInUnitigs 16818 16078 16748 15879 16607
NumRunsOfGapsInUnitigReads 287426 272686 286197 266151 283892
NumColumnsInContigs 2346181 2354085 2350609 2348699 2341053
NumGapsInContigs 16694 15968 16633 15768 16500
NumRunsOfGapsInContigReads 281979 267893 281234 261530 278662
NumAAMismatches 993 868 945 859 906
NumVARStringsWithFlankingGaps 91 74 88 77 90