Pair classification within Illumina mate pair data

From wgs-assembler
Jump to: navigation, search

This poster was presented at Cold Spring Harbor Genome Informatics, November 2-5, 2011.

DeNovo Classification Poster.


Brian Walenz, Granger Sutton, Jason Miller
The J. Craig Venter Institute
Rockville MD 20850 USA


Paired reads can provide long-range connectivity for the de novo genome assembly process. The Illumina platforms for DNA sequencing offer two protocols for generating paired reads. The paired end protocol (PE) delivers pairs with genome separations in the range of 300bp, while the mate pair protocol (MP) delivers pairs with genome separations of 3Kbp and higher. Illumina MP sequencing actually generates three pair types, described here as (1) true mate pairs, (2) unintentional paired ends, and (3) junction pairs where one read is chimeric with respect to the genome. Pair type is not captured and not represented in the sequencing data. This fact compromises the utility of MP data. We have developed a bioinformatics pipeline called DNC for de novo classification of Illumina mate pairs by pair type. DNC classifies without a reference, using other reads from the same genome. DNC computes pair-wise alignments between the reads and then constructs an overlap graph. It searches the graph for paths between the two reads of each pair. Finally, it classifies pairs based on path characteristics including length. DNC can construct an overlap graph from any mixture of Illumina PE reads, 454 reads, Sanger reads, or even the Illumina MP reads being classified. DNC does not classify pairs that span a coverage gap or a very-high- copy repeat. DNC should be helpful to assemblers that rely on mate pairs as confirmation of contigs built from read sequences.


The Illumina mate pair sequencing protocol presents a mixture of three pair types:

  1. Outie long-range “mate pairs”
  2. Innie short-range “paired ends”
  3. "Junction pairs” with one chimeric read.

Motivation and Method

The Celera Assembler1 software tests preliminary assemblies for concordance with the mate pair constraints. This test can be confounded when it encounters mate pairs and paired ends in the same library. To increase accuracy of this test, we developed a pre-processor that partitions each Illumina mate pair library into three libraries by type.

The de novo classifier (DNC) is new software that classifies every pair by constructing an overlap graph from all-vs-all read alignments and searching the graph for inter-pair paths. Genomic regions with low coverage could induce false negatives. Genomic repeats could induce false paths, increasing search time. The long-running mate pair test is not required if the other tests are accurate.


DNC was challenged to classify simulated read pairs drawn from a real genome. In each test, DNC used a different set of simulated unpaired reads to use for its overlap graph. These sets varied by read length and genome coverage.


The de novo classifier (DNC) is new software that will be bundled with the Celera Assembler starting with version 7. The Celera Assembler is supported by open source and user group meetings. Please visit [1]