Yersinia pestis KIM D27, using Illumina paired-end reads, with CA8.2

From wgs-assembler
Jump to: navigation, search


We will be assembling a 600bp Illumina paired-end library from Yersinia pestis KIM D27, SRP001358. This is captured in experiment SRX048908.

Fetch

mkdir READS-sra
cd READS-sra

curl -O ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR133/SRR133640/SRR133640.sra

Convert

Using fastq-dump from the SRA Toolkit, convert the NCBI .sra files into .fastq files, then generate a FRG format wrapper to load them into the assembler.

% fastq-dump \
    --split-files --split-spot --skip-technical \
    --minReadLen 64 \
    --defline-seq  @\$ac.\$si \
    --defline-qual "+" \
    ./SRR133640.sra

% fastqToCA \
    -libraryname SRR133640 \
    -insertsize 600 60 \
    -technology illumina \
    -type sanger \
    -mates SRR133640_1.fastq,SRR133640_2.fastq \
  > SRR133640.frg

Correct

Instead of using Overlap Based Trimming to trim the reads, we'll try to correct errors, trimming off ends that cannot be corrected. As with most other correction strategies, merTrim uses k-mers as evidence. We first concatenate the short reads into larger sequences (using fastq-to-fasta-merged.pl), then count k-mers, and finally dump the histogram of counts.

% cat SRR133640_1.fastq SRR133640_2.fastq | fastq-to-fasta-merged.pl > SRR133640.merged.fasta

% meryl -v -B -C -m 19 -s SRR133640.merged.fasta -o SRR133640
Have NO LIMITS!: mersPerBatch=233487221 segmentLimit=1
basesPerBatch = 233491396
Computing 1 segments using AS MUCH MEMORY AS NEEDED.
  numMersActual      = 233487221
  mersPerBatch       = 233487221
  basesPerBatch      = 233491396
  numBuckets         = 8388608 (23 bits)
  bucketPointerWidth = 28
  merDataWidth       = 15
 Allocating 417MB for mer storage (15 bits wide).
 Allocating 28MB for bucket pointer table (28 bits wide).
 Allocating 32MB for counting the size of each bucket.
 Counting mers in buckets:  186.92 Mmers -- 19.14 Mmers/second
 Creating bucket pointers.
 Releasing 32MB from counting the size of each bucket.
 Filling mers into list:    186.92 Mmers --  5.37 Mmers/second
 Writing output:            186.92 Mmers -- 19.06 Mmers/second
Segment 0 finished.

% meryl -Dh -s SRR133640 > SRR133640.histogram
Found 186922193 mers.
Found 25181862 distinct mers.
Found 19096862 unique mers.
Largest mercount is 4312; 0 mers are too big for histogram.

The histogram is showing the expected shape, with coverage peak at about 33x, and LOTS of error kmers.

K-mer histogram

Correction is done once per file, then we create a new FRG format wrapper for the corrected reads.

% merTrim -v -t 8 -m 19 -mc SRR133640 -mCillumina -F SRR133640_1.fastq -o SRR133640_corrected_1.fastq
Guessed X coverage is 34
Use minCorrect=11 minVerified=8
creating adapter mer database.
loading genome mer database from meryl 'SRR133640'.
 17229.6/s -      603 queued for compute;  1155296 finished;      115 queued for output)ritten;      115 queued for output)
Success!  Bye.

% merTrim -v -t 8 -m 19 -mc SRR133640 -mCillumina -F SRR133640_2.fastq -o SRR133640_corrected_2.fastq
Guessed X coverage is 34
Use minCorrect=11 minVerified=8
creating adapter mer database.
loading genome mer database from meryl 'SRR133640'.
 17160.9/s -      660 queued for compute;  1155239 finished;      128 queued for output)ritten;      128 queued for output)
Success!  Bye.

% fastqToCA \
    -libraryname SRR133640 \
    -insertsize 600 60 \
    -technology illumina \
    -type sanger \
    -mates SRR133640_corrected_1.fastq,SRR133640_corrected_2.fastq \
  > SRR133640.corrected.frg

cd ..

Assemble

We'll assemble both the raw uncorrected reads and the merTrim correcrted reads. The uncorrected reads are first trimmed with Overlap Based Trimming, while the corrected reads are assembled as is.

SGE is enabled, but no special settings are needed.

runCA -p ypestis -d ypestis-raw       useGrid=1 scriptOnGrid=1 doOBT=1 unitigger=bogart READS-sra/SRR133640.frg
runCA -p ypestis -d ypestis-corrected useGrid=1 scriptOnGrid=1 doOBT=0 unitigger=bogart READS-sra/SRR133640.corrected.frg

Results

Far from great assemblies, especially compared to the assembly with 454 reads.

QC statistics comparison

With corrected reads, the assembly is much cleaner (fewer small scaffolds, small contigs, degenerate contigs, singletons), has more good mate pairs incorporated, and more reads assembled into contigs. None of this helped the assembly structure though.

Results are generally comparable to the previous version.

mergeqc.pl -wiki ypestis-*/*qc > qc.wiki
Files
ypestis-corrected/ypestis.qc ypestis-raw/ypestis.qc
Scaffolds
TotalScaffolds 144 146
TotalContigsInScaffolds 288 315
MeanContigsPerScaffold 2.00 2.16
MinContigsPerScaffold 1 1
MaxContigsPerScaffold 9 9
TotalBasesInScaffolds 4488105 4476867
MeanBasesInScaffolds 31167 30663
MinBasesInScaffolds 1655 193
MaxBasesInScaffolds 120993 120082
N25ScaffoldBases 73775 71451
N50ScaffoldBases 51558 51552
N75ScaffoldBases 30400 34140
ScaffoldAt1000000 78707 74740
ScaffoldAt2000000 58683 56764
ScaffoldAt3000000 37107 39292
ScaffoldAt4000000 17946 18052
TotalSpanOfScaffolds 4485731 4480029
MeanSpanOfScaffolds 31151 30685
MinScaffoldSpan 1655 193
MaxScaffoldSpan 120930 120053
IntraScaffoldGaps 144 169
2KbScaffolds 143 141
2KbScaffoldSpan 4484076 4475153
MeanSequenceGapLength -16 19
Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID
0 6,120993,120930,20166,-13,120002257139 6,120082,120053,20014,-6,120002326318
1 5,115726,115652,23145,-18,120002257175 4,111978,111918,27994,-20,120002326348
2 3,111924,111884,37308,-20,120002257114 6,103338,103238,17223,-20,120002326315
3 6,103551,103493,17258,-12,120002257248 7,102219,102178,14603,-7,120002326346
4 5,103365,103285,20673,-20,120002257123 4,97479,97419,24370,-20,120002326290
total 25,555559,555244,22222,-16 27,535096,534806,19818,-13
Contigs
TotalContigsInScaffolds 288 315
TotalBasesInScaffolds 4488105 4476867
TotalVarRecords 1827 13582
MeanContigLength 15584 14212
MinContigLength 67 75
MaxContigLength 68862 67393
N25ContigBases 42689 42614
N50ContigBases 26453 25632
N75ContigBases 13516 12762
ContigAt1000000 46440 44521
ContigAt2000000 28886 28132
ContigAt3000000 18835 17483
ContigAt4000000 8365 7392
BigContigs_greater_10000
TotalBigContigs 147 139
BigContigLength 3805190 3630141
MeanBigContigLength 25886 26116
MinBigContig 10057 10061
MaxBigContig 68862 67393
BigContigsPercentBases 84.78 81.09
SmallContigs
TotalSmallContigs 141 176
SmallContigLength 682915 846726
MeanSmallContigLength 4843 4811
MinSmallContig 67 75
MaxSmallContig 9983 9975
SmallContigsPercentBases 15.22 18.91
DegenContigs
TotalDegenContigs 703 1055
DegenContigLength 173942 211164
DegenVarRecords 321 1013
MeanDegenContigLength 247 200
MinDegenContig 65 64
MaxDegenContig 14409 14394
DegenPercentBases 3.88 4.72
Top5Contigs=reads,bases,EUID
0 31984,68862,120002256119 29849,67393,120002326050
1 31311,66961,120002256964 29423,66682,120002326184
2 29872,63712,120002257089 30136,63845,120002325973
3 28658,61840,120002256928 28425,63803,120002326103
4 28626,61745,120002256920 27185,62123,120002326057
total 150451,323120 145018,323846
UniqueUnitigs
TotalUUnitigs 3356 10963
MinUUnitigLength 64 64
MaxUUnitigLength 68865 35388
MeanUUnitigLength 1388 484
SDUUnitigLength 5199 2017
Surrogates
TotalSurrogates 667 1077
SurrogateInstances 3199 3390
SurrogateLength 102194 138997
SurrogateInstanceLength 474288 495470
UnPlacedSurrReadLen 6236681 6191202
PlacedSurrReadLen 1043322 800936
MinSurrogateLength 77 64
MaxSurrogateLength 5450 4850
MeanSurrogateLength 153 129
SDSurrogateLength 269 206
Mates
ReadsWithNoMate 53710(2.39%) 122954(5.73%)
ReadsWithGoodMate 2008062(89.19%) 1838868(85.74%)
ReadsWithBadShortMate 0(0.00%) 0(0.00%)
ReadsWithBadLongMate 404(0.02%) 184(0.01%)
ReadsWithSameOrientMate 194(0.01%) 118(0.01%)
ReadsWithOuttieMate 248(0.01%) 196(0.01%)
ReadsWithBothChaffMate 1562(0.07%) 1340(0.06%)
ReadsWithChaffMate 13946(0.62%) 22134(1.03%)
ReadsWithBothDegenMate 104160(4.63%) 85844(4.00%)
ReadsWithDegenMate 14376(0.64%) 14062(0.66%)
ReadsWithBothSurrMate 31718(1.41%) 39812(1.86%)
ReadsWithSurrogateMate 8902(0.40%) 6620(0.31%)
ReadsWithDiffScafMate 14110(0.63%) 12670(0.59%)
ReadsWithUnassignedMate 0(0.00%) 0(0.00%)
TotalScaffoldLinks 5 0
MeanScaffoldLinkWeight 13.00 0.00
Reads
TotalReadsInput NA NA
TotalUsableReads 2251392 2144802
AvgClearRange 98 96
ContigReads 2076829(92.25%) 1972845(91.98%)
BigContigReads 1784966(79.28%) 1624081(75.72%)
SmallContigReads 291863(12.96%) 348764(16.26%)
DegenContigReads 102310(4.54%) 93285(4.35%)
SurrogateReads 74288(3.30%) 73169(3.41%)
PlacedSurrogateReads 10703(0.48%) 8606(0.40%)
SingletonReads 8668(0.39%) 14109(0.66%)
ChaffReads 8653(0.38%) 14105(0.66%)
Coverage
ContigsOnly 45.46 42.15
Contigs_Surrogates 46.85 43.54
Contigs_Degens_Surrogates 47.26 43.49
AllReads 49.29 45.84
TotalBaseCounts
BasesCount NA NA
ClearRangeLengthFRG NA NA
ClearRangeLengthASM 221202093 205210547
SurrogateBaseLength 7280003 6992138
ContigBaseLength 204044981 188715188
DegenBaseLength 10061109 8983552
SingletonBaseLength 859322 1320605
Contig_SurrBaseLength 210281662 194906390
gcContent
Content 47.48 47.40
Unitig Consensus
NumColumnsInUnitigs 17376479 20940564
NumGapsInUnitigs 1616 4844
NumRunsOfGapsInUnitigReads 211615 926577
Contig Consensus
NumColumnsInUnitigs 4666245 4698343
NumGapsInUnitigs 4206 10178
NumRunsOfGapsInUnitigReads 242185 389054
NumColumnsInContigs 4666094 4698199
NumGapsInContigs 4050 10033
NumRunsOfGapsInContigReads 233367 378799
NumAAMismatches 2692 21370
NumVARStringsWithFlankingGaps 1137 1298