Preprocessing Recipe 4

From wgs-assembler
Jump to: navigation, search

We developed a method for trimming, correcting and generally fixing up reads from Illumina, 454 and Sanger sequencers. This process became known as 'trim 4', possibly because it was the fourth attempt at trimming the reads. It was never officially written up, or scripted or really even used 'in production'. During archeology to reclaim disk space, BPW discovered two versions of this process. Not wanting to delete them, or spend significant effort pipelining and documenting them, they are being dumped here in more or less the running order.

In addition to the usual erroneous base calls, these reads have several specific problems:

Sanger reads:

  • Low quality ends, of a few bases on the 5' end, and many many bases on the 3' end.

454 reads:

  • Homopolymer run errors.
  • Mate pairs are joined together in one read, with linker sequence in the middle.

Illumina reads:

  • Inconsistent quality score encoding.
  • Paired end reads can be overlapping.
  • Mate pair (jumping) reads are in the 'outtie' orientation.
  • Mate pair (jumping) reads are contaminated with short-insert paired-end reads.
  • Mate pair (jumping) reads are contaminated with chimeric reads that spanned the circularization junction.


The method followed the same basic steps (with the Celera Assembler components invlolved listed in parenthesis):

  • collect links to raw reads
  • convert QVs to Sanger format (fastqAnalyze)
  • convert any 454 SFFs to fastq (sffToCA)
  • flash any overlapping PE
  • correct errors, trim off adapter (merTrim)
  • remove duplicates (runCA-dedupe)
  • map to a known reference, can filter PE from this
  • separate PE from MP (classifyMates)
  • downsample (fastqSample)
  • collect final reads, build .frg format (fastqToCA)


I was intending to put up the scripts one by one, as below, but now I don't quite feel like making 45 pages. So have a tar ball instead.

Version 1

From a 3 Gbp genome, with gobs of Illumina spanning almost all generations (fortunately, not the first generation!), library preparations and both PE and MP, and some (obviously) old Sanger plasmid, fosmid and BAC ends. If I recall correctly, we did, at least, try the merTrim process on the Sanger reads, but care needs to be taken so that Illumina biases don't destroy the reads.


Version 2

From a 40 Mbp genome, with somewhere around 1000x coverage in both Illumina PE and Illumina MP, plus some 454 8 Kbp mate pairs.