Yersinia pestis KIM D27, using Illumina paired-end reads, with CA8

From wgs-assembler
Jump to: navigation, search


We will be assembling a 600bp Illumina paired-end library from Yersinia pestis KIM D27, SRP001358. This is captured in experiment SRX048908.

Fetch

mkdir READS-sra
cd READS-sra

curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR133/SRR133640/SRR133640.sra

Convert

Simply dump the reads as fastq, and generate a FRG format wrapper to load them into the assembler.

sra-extract-fastq.sh SRR133640.sra

fastqToCA \
  -libraryname SRR133640 \
  -insertsize 600 60 \
  -technology illumina \
  -type sanger \
  -mates SRR133640.1.fastq,SRR133640.2.fastq \
> SRR133640.frg

Correct

Instead of using Overlap Based Trimming to trim the reads, we'll try to correct errors, trimming off ends that cannot be corrected. As with most other correction strategies, merTrim uses k-mers as evidence. We first concatenate the short reads into larger sequences, then count k-mers, and finally dump the histogram of counts.

cat SRR133640.1.fastq SRR133640.2.fastq | fastq-to-fasta-merged.pl > SRR133640.merged.fasta

meryl -v -B -C -m 19 -s SRR133640.merged.fasta -o SRR133640

meryl -Dh -s SRR133640 > SRR133640.histogram

The histogram is showing the expected shape, with coverage peak at about 33x, and LOTS of error kmers.

K-mer histogram

Correction is done once per file, then we create a new FRG format wrapper for the corrected reads.

merTrim -v -t 8 -m 19 -mc SRR133640 -mCillumina \
  -F SRR133640.1.fastq -o SRR133640.corrected.1.fastq

merTrim -v -t 8 -m 19 -mc SRR133640 -mCillumina \
  -F SRR133640.2.fastq -o SRR133640.corrected.2.fastq \

fastqToCA \
  -libraryname SRR133640 \
  -insertsize 600 60 \
  -technology illumina \
  -type sanger \
  -mates SRR133640.corrected.1.fastq,SRR133640.corrected.2.fastq \
> SRR133640.corrected.frg

Assemble

Turn off fragment correction too?

runCA -p ypestis -d ypestis-raw       -s ypestis.spec doOBT=1 unitigger=bogart READS-sra/SRR133640.frg
runCA -p ypestis -d ypestis-corrected -s ypestis.spec doOBT=0 unitigger=bogart READS-sra/SRR133640.corrected.frg

Results

Far from great assemblies, especially compared to the assembly with 454 reads.

QC statistics comparison

With corrected reads, the assembly is much cleaner (fewer small scaffolds, small contigs, degenerate contigs, singletons), has more good mate pairs incorporated, and more reads assembled into contigs. None of this helped the assembly structure though.

mergeqc.pl -wiki ypestis-*/*qc > qc.wiki
Files
ypestis-corrected/ypestis.qc ypestis-raw/ypestis.qc
Scaffolds
TotalScaffolds 145 223
TotalContigsInScaffolds 289 394
MeanContigsPerScaffold 1.99 1.77
MinContigsPerScaffold 1 1
MaxContigsPerScaffold 9 11
TotalBasesInScaffolds 4488041 4485639
MeanBasesInScaffolds 30952 20115
MinBasesInScaffolds 1655 65
MaxBasesInScaffolds 120993 120581
N25ScaffoldBases 71468 66694
N50ScaffoldBases 49329 49368
N75ScaffoldBases 31714 34128
ScaffoldAt1000000 73775 68407
ScaffoldAt2000000 52093 55730
ScaffoldAt3000000 37107 38466
ScaffoldAt4000000 17946 16973
TotalSpanOfScaffolds 4485627 4488813
MeanSpanOfScaffolds 30935 20129
MinScaffoldSpan 1655 65
MaxScaffoldSpan 120930 120546
IntraScaffoldGaps 144 171
2KbScaffolds 144 144
2KbScaffoldSpan 4483972 4477254
MeanSequenceGapLength -17 19
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 6,120993,120930,20166,-13,120002257128 6,120581,120546,20097,-7,120002326108
1 5,115726,115652,23145,-18,120002257165 3,111815,111775,37272,-20,120002326179
2 3,111924,111884,37308,-20,120002257105 6,103331,103231,17222,-20,120002326102
3 5,103365,103285,20673,-20,120002257114 7,102182,102151,14597,-5,120002326138
4 8,102377,102281,12797,-14,120002257120 4,88507,88453,22127,-18,120002326190
total 27,554385,554032,20533,-16 26,526416,526156,20247,-12
Contigs
TotalContigsInScaffolds 289 394
TotalBasesInScaffolds 4488041 4485639
TotalVarRecords 1833 7671
MeanContigLength 15530 11385
MinContigLength 67 65
MaxContigLength 68862 67389
N25ContigBases 42689 40921
N50ContigBases 26666 24644
N75ContigBases 13364 12731
ContigAt1000000 45079 44514
ContigAt2000000 29667 27012
ContigAt3000000 18571 16527
ContigAt4000000 8315 7209
BigContigs_greater_10000
TotalBigContigs 146 142
BigContigLength 3788421 3643671
MeanBigContigLength 25948 25660
MinBigContig 10057 10121
MaxBigContig 68862 67389
BigContigsPercentBases 84.41 81.23
SmallContigs
TotalSmallContigs 143 252
SmallContigLength 699620 841968
MeanSmallContigLength 4892 3341
MinSmallContig 67 65
MaxSmallContig 9983 9919
SmallContigsPercentBases 15.59 18.77
DegenContigs
TotalDegenContigs 702 966
DegenContigLength 173734 202870
DegenVarRecords 320 691
MeanDegenContigLength 247 210
MinDegenContig 65 65
MaxDegenContig 14409 17852
DegenPercentBases 3.87 4.52
Top5Contigs=reads,bases,EUID
0 31984,68862,120002256110 29724,67389,120002325974
1 31311,66961,120002256955 29303,66586,120002325893
2 29872,63712,120002257075 30018,63845,120002325765
3 28658,61840,120002256919 28287,63704,120002325810
4 28626,61745,120002256912 27081,62122,120002326021
total 150451,323120 144413,323646
UniqueUnitigs
TotalUUnitigs 3347 11021
MinUUnitigLength 64 64
MaxUUnitigLength 68865 35382
MeanUUnitigLength 1391 482
SDUUnitigLength 5206 2000
Surrogates
TotalSurrogates 668 911
SurrogateInstances 3163 2868
SurrogateLength 102344 122777
SurrogateInstanceLength 469406 491421
UnPlacedSurrReadLen 6246220 6294356
PlacedSurrReadLen 1036340 846995
MinSurrogateLength 77 67
MaxSurrogateLength 5450 3643
MeanSurrogateLength 153 135
SDSurrogateLength 269 198
Mates
ReadsWithNoMate 53710(2.39%) 131471(6.16%)
ReadsWithGoodMate 2007884(89.18%) 1826976(85.58%)
ReadsWithBadShortMate 0(0.00%) 0(0.00%)
ReadsWithBadLongMate 452(0.02%) 190(0.01%)
ReadsWithSameOrientMate 194(0.01%) 118(0.01%)
ReadsWithOuttieMate 254(0.01%) 184(0.01%)
ReadsWithBothChaffMate 1562(0.07%) 1378(0.06%)
ReadsWithChaffMate 13964(0.62%) 21136(0.99%)
ReadsWithBothDegenMate 104144(4.63%) 79294(3.71%)
ReadsWithDegenMate 15094(0.67%) 11052(0.52%)
ReadsWithBothSurrMate 31746(1.41%) 42066(1.97%)
ReadsWithSurrogateMate 8284(0.37%) 8242(0.39%)
ReadsWithDiffScafMate 14104(0.63%) 12744(0.60%)
ReadsWithUnassignedMate 0(0.00%) 0(0.00%)
TotalScaffoldLinks 7 1
MeanScaffoldLinkWeight 18.00 34.00
Reads
TotalReadsInput NA NA
TotalUsableReads 2251392 2134851
AvgClearRange 98 95
ContigReads 2076769(92.24%) 1966763(92.13%)
BigContigReads 1778102(78.98%) 1624023(76.07%)
SmallContigReads 298667(13.27%) 342740(16.05%)
DegenContigReads 102262(4.54%) 88247(4.13%)
SurrogateReads 74317(3.30%) 74638(3.50%)
PlacedSurrogateReads 10632(0.47%) 9037(0.42%)
SingletonReads 8676(0.39%) 14240(0.67%)
ChaffReads 8662(0.38%) 14234(0.67%)
Coverage
ContigsOnly 45.46 41.84
Contigs_Surrogates 46.85 43.24
Contigs_Degens_Surrogates 47.27 43.18
AllReads 49.29 45.43
TotalBaseCounts
BasesCount NA NA
ClearRangeLengthFRG NA NA
ClearRangeLengthASM 221202093 203775929
SurrogateBaseLength 7282560 7141351
ContigBaseLength 204039330 187673720
DegenBaseLength 10056465 8476955
SingletonBaseLength 860078 1330898
Contig_SurrBaseLength 210285550 193968076
gcContent
Content 47.48 47.35
Unitig Consensus
NumColumnsInUnitigs 17376511 20907336
NumGapsInUnitigs 1629 2833
NumRunsOfGapsInUnitigReads 205193 669552
Contig Consensus
NumColumnsInUnitigs 4665774 4705207
NumGapsInUnitigs 4002 12184
NumRunsOfGapsInUnitigReads 243047 231762
NumColumnsInContigs 4665624 4705125
NumGapsInContigs 3850 12101
NumRunsOfGapsInContigReads 234051 221983
NumAAMismatches 2691 560632
NumVARStringsWithFlankingGaps 1156 800