PBcR

From wgs-assembler
Jump to: navigation, search

If you are assembling PacBio or Oxford Nanopore data, please use Canu instead. Celera Assembler is no longer being maintained.

Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing. Nature Biotechnology. (2015).
Konstantin Berlin*, Sergey Koren*, Jason Chin, James P. Drake Jane M. Landolin Adam M. Phillippy,

  • The ‘MHAP’ paper demonstrates that probabilistic alignment methods are well suited to address the read length and error rate of single-molecule and can significantly improve on the runtimes of existing approaches. Please cite for projects using non-hybrid CA v8.2 on.


Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biology 14:R101 (2013).

Sergey Koren, Gregory P Harhay, Timothy P. Smith, James L. Bono, Dayna M Harhay, D Scott McVey, Diana Radune, Nicholas H. Begman, Adam M. Phillippy,

  • The ‘assembly complexity’ paper demonstrates that a majority of microbial genomes can be fully resolved to finished-grade accuracy using PacBio SMRT sequencing based on an analysis of all genomes in Genbank as of 2012. An updated version of PBcR for non-hybrid hierarchical assembly is also presented. Please cite for projects using non-hybrid PBcR with CA prior to v8.2.


Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. (2012).

Sergey Koren, Michael Schatz, Brian Walenz, Jeffrey Martin, Jason Howard, Ganeshkumar Ganapathy, Zhong Wang, David A. Rasko, W. Richard McCombie, Erich D. Jarvis, and Adam M. Phillippy,

  • The ‘PBcR’ paper presents the first hierarchical assembly method (overlap, correct, assemble) and demonstrates that long, noisy reads can be used for assembly after correction. The hierarchical approach is now used by many single-molecule assembly tools, including HGAP, Falcon, and others. Please cite for projects utilizing hybrid PBcR correction or assembling long-read (>2Kbp) reads with Celera Assembler.

About PacBio RS

The PacBio RS sequencing instrument from Pacific Biosciences is a single-molecule sequencer that is able to generate long-reads (CLR). Unfortunately, these sequences have a low accuracy, averaging only 85-88%. While It is possible to read each circularized molecule multiple times to get higher accuracy sequences (CCS), this produces significantly shorter sequences.

About Oxford Nanopore MinION

The Oxford Nanopore MinION sequencing instrument from Oxford Nanopore Technologies is a single-molecule sequencer. As it is based on nanopore sensing and requires no optics, it is a miniaturized device, approximately the size of a thumb-drive. The most recent chemistry (7.3) and basecaller (Metrichor >1.9) can produce 2D (sequence read in both the template and reverse strand) data with 78-85% accuracy (Loman et. al. 2015). At this error rate, PBcR can work out of the box with nanopore data. .

About (PBcR)

The PBcR pipeline enables the use of the high-noise long-read sequences produced by a single-molecule instrument. It was the first hybrid assembly pipeline. The hybrid approach works by first improving the accuracy of the single-molecule sequences and then assembling the improved sequences. To algorithmically deal with the error, we support solely high-noise data from recent chemistries (C2 for PacBio or 7.3 for Oxford) or high-identity sequences (454, Illumina, or PacBio circularized sequences) to re-call accurate consensus.

The pipeline outputs fastq, fasta, and qual formats as well as a Celera Assembler FRG file and will automatically assemble the data.

If you are having trouble, please check the known issues and known Celera Assembler issues to see if your problem is documented. If you are working with large genomes (>100Mbp), check the recommended parameters section for instructions.

Announcements

  • May 24, 2015 - PBcR and CA 8.3rc2 as source or pre-compiled for Linux is now available. CA 8.3rc2 includes support for Oxford Nanopore parameters as well as the PBS/Torque cluster system. Computing overlaps for 70X of P6 human data now takes < 8K cpu hours on an NFS file system. Performance on low-coverage (<30X) parameters is improved with adjusted parameters, requiring only 20X P5 data to close an E. coli genome. See the release notes for details.
  • March 25, 2015 - Added experimental suggested parameters for very low-coverage assembly (30X or less).
  • February 23, 2015 - PBcR and CA 8.3rc1 as source or pre-compiled for Linux is now available. CA 8.3rc1 includes an updated version of MHAP which is approximately 4-fold faster. Computing overlaps for 54X of human data now takes < 20K cpu hours on an NFS file system instead of 80K. Performance on low-coverage datasets is also significantly improved, requiring only 30X P5 data to close an E. coli genome. See the release notes for an example on S. cerevisae.
  • July 19, 2014 - PBcR and CA 8.2 as source or pre-compiled for Linux is now available. PBcR now incorporates a novel probabilistic overlapper for self-correction of sequences named MHAP. This allows assembly of prokaryotic genomes in < 30 minutes on a typical desktop and assembly of small eukaryotic genomes in < 2 days. For best results, java 1.8r25 or newer is recommended to use MHAP.
  • January 13, 2014 - PBcR followed by assembly using CA 8.1 is used to assemble D. melanogaster using only PacBio RS II sequencing reads. The CA assembly places entire chromosome arms de novo into fewer pieces than the current version of the reference genome (Release 5). The largest contig comprises the majority of chromosome arm 3L. All autosomal arms (2L, 2R, 3R, and 4) are in less than six large contigs. More details, raw and corrected data, as well as the assembly are available at the PacBio Blog and CBCB.
  • December 17, 2013 - CA 8.1 as source or pre-compiled for Linux is available. CA 8.1 supports self-correction of PacBio as well as correction of PacBio sequences using complementary high-identity sequencing (such as Illumina, 454, or CCS).
  • April 4, 2013 - Updated documentation to describe recommended Celera Assembler parameters and Quiver polishing of consensus for assembly of corrected sequences. Please see assembly of corrected sequences for more information.
  • March 19, 2013 - Updated documentation to describe new option of self-correction. PBcR can correct PacBio RS C2 data (or newer) using solely long-read low-accuracy data. Please see self-correction for more information.
  • March 18, 2013 - Updated documentation to describe new support for both BLASR or Bowtie 2 for alignment (in addition to CA's overlapper).
  • August 9, 2012 - An NGS Leaders webinar by Adam Phillippy and Sergey Koren is now available at YouTube.
  • July 1, 2012 - The paper on correction and assembly is now available as an advance online publication from Nature Biotechnology.
  • June 20, 2012 - New known issue: using Illumina sequences shorter than 100bp for correction requires adjusted parameters. Please see the known issues for the required parameters.
  • May 4, 2012 - New known issue: PacBio SMRT-portal 1.3.0 outputs fastq files in a non-standard format. If you encounter errors running the pipeline or have no corrected sequences output, please see the known issues.

Usage

The PBcR pipeline is available in CA 7.0 or newer, however CA 8.3 is recommended. The self-correction feature, requires at least CA 8.2 but Celera Assembler 8.3 is recommended. The instructions below assume you are using the CA 8.3 release. If you are using an older version of CA (8.1 or earlier), please see the legacy instructions.

usage:PBcR [options] -s spec.file fastqFile=fastqfile [frg]
  -length        Minimum length of PacBio RS fragment to keep.
  -partitions    Number of partitions for consensus.
  -l libraryname Name of the library; freeformat text.
  -t threads     Number of threads to use for correction.
  -noclean       Do not clean up after the correction finishes. By default, the temp<libraryname> directory will be removed on successful completion of the correction.
 

The pipeline will corrected the sequences and output a CA FRG file consisting of a single LIB message to <libraryname>.frg. It will also output the corrected fasta, qual, and fastq sequences (named <libraryname>.fasta, <libraryname>.qual, and <libraryname>.fastq). Finally, it will assemble the data with default parameters.

In addition, the pipeline will also output <libraryname>.log file which indicates the source for each corrected sequence. The format of the file is:

% head <libraryname>.log
INPUT_NAME	                                                                OUTPUT_NAME	  SUBREAD	 START      END	       LENGTH
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8962/3861_17595	pacbio_3307_1	    1	          53	   13682	13629
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8965/6675_14278	pacbio_3309_1	    1	          51	    7551	 7500
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8977/0_17507	pacbio_3316_1	    1	          63	   17456	17393

The first column is the input name of the PacBio RS sequencing read which was corrected. The second column is the name of the output corrected sequence. The third column indicates if the input PacBio RS sequence was split into more than a single fragment. This can happen due to SMRTbell adapters or issues correcting a subset of the long sequence. The next columns indicate the positions in the original PacBio RS sequence where the corrected read came from. The final column is the length of the corrected sequence. However, the log file and positions are estimates because they are a pre-consensus view. The consensus module may trim/split/discard sequences if there is insufficient coverage to achieve a minimum QV for the corrected sequence.

The <libraryname>.correction.* and <libraryname>.err files may also get created based on whether the pipeline runs on the grid or a single machine. These provide debugging output. If you encounter issues with the pipeline or have questions, please include these *.err files with your email.

The -libraryname option is mandatory. It provides a UID name for the library that the reads are placed into. This name is completely up to the end user, but must contain no white spaces or commas.

The -s pacbio.spec option is mandatory. It should point to a a CA spec file to specify how to overlap the illumina and PacBio data for correction (which comes included with the sample data.) On a shared-memory multi-core system, the spec file can be empty and the pipeline will auto-detect the available resources. On a grid environment (SGE/LFS/PBS) see more information in the grid configuration section.

The fastqFile option is mandatory. It specifies the pacbio fragments to correct. The correction will also work with unfiltered reads if you don't have a filtered fastq file.

The options (-length, -partitions, and -t, -noclean) are optional.

You can specify an arbitrary number of FRG files to be used as the trusted sequences for correction. If you have 20X+ coverage of C2 sequencing data (or newer) you can also perform self-correction. This option is automatically enabled if you omit the FRG files. A total of 50+X coverage is recommended but good assemblies can be generated from as low as 15X. Below is an example on S. cerevisiae comparing CA 8.2 and CA 8.3 performance at varying coverages. If you have less than 30X, see the recommended parameters below.

Contig NG50 (A) and NG10 (B) lengths are given for PBcR assemblies of S. cerevisiae at various sequencing coverage (10–120X) using BLASR, MHAP fast, MHAP sensitive in CA 8.2, and a beta version of CA 8.3. For this example, there is benefit from increasing overlap sensitivity at lower coverage. The MHAP fast parameters mirror the BLASR assembly, while the MHAP sensitive parameters outperform the other assemblies at lower coverage. CA 8.3 automatically switches to MHAP sensitive below 50X and fast above. Below 30X, CA 8.3's low coverage parameters are used, which must be manually enabled by the user. The dashed black line is an Illumina Miseq 2x300 @450bp assembly from Lee et. al 2014. The ECTools result is also from Lee et. al. using the Illumina MiSeq assembly for PacBio correction Lee et. al 2014.

Below 15X, a hybrid approach is likely to outperform self-correction. Input FRG files can be any format CA accepts. Please see the FastqToCA and SffToCA pages for information on inputting Illumina or 454 data. Generally, you don't need more than about 50X of high-identity sequences for correction. Note that this mode of PBcR is no longer being updated and is significantly slower than self-correction due to its reliance on BLASR (or a built-in module) for overlap computation.

Inputting Single-Molecule Sequences

The PBcR pipeline expects single-molecule sequences in fastq format (with sanger (PHRED32) quality values). If you already have fasta files, you can download a Java utility to convert the fasta files into PBcR compatible fastq files:

java -jar convertFastaAndQualToFastq.jar myfile.fasta > myfile.fastq

NOTE: If you are using the SMRT-portal version 1.3, please see the known issues if you encounter errors or no corrected sequences are output.

Self-Correction With C2/7.3 Sequences (or newer)

Starting with the C2 chemistry (PacBio, and the more recent chemistries), or 7.3 2D sequences (MinION), you can self-correct PacBio RS sequences without any high-identity technology. Celera Assembler 8.3 is recommended. Approximately 50X (or more) of sequencing is ideal. If you have less than 15X PacBio RS sequencing, then you can use hybrid correction with high-identity data. While PBcR supports high-identity data for correction, this mode is no longer being updated, is not supported for MinION data, and is significantly slower than self-correction due to its reliance on BLASR (or a built-in module) for overlap computation.

To run the pipeline, follow the usage instructions below.

Correction Using Complementary High-Identity Sequences

If you have less than 15X sequencing or older chemistry, then it is recommended to combine the PacBio data with a complementary high-identity sequencing dataset. These can be Illumina, 454, Ion Torrent, or PacBio CCS reads. Note that this mode of PBcR is no longer being updated and is significantly slower than self-correction due to its reliance on BLASR (or a built-in module) for overlap computation.

Inputting High-Identity Sequences

The PBcR pipeline accepts an arbitrary number of frg files at the end of the command.

  • For Illumina data, use the fastqToCA command. For examples, see the phage example below and the usage page.
  • For 454 data, use sffToCA.
  • For PacBio RS CCS sequences, use the fastqToCA command. For examples, see the usage page.

Tutorials

Assembling a Lambda Phage

This tutorial covers both self and hybrid correction. Download the sample data.

% tar xvzf sampleData.tar.gz
sampleData/
sampleData/illumina.fastq
sampleData/reference.fasta
sampleData/pacbio.filtered_subreads.fasta
sampleData/pacbio.spec
sampleData/pacbio.SGE.spec
sampleData/README
sampleData/convertFastaAndQualToFastq.jar

Prepare PacBio RS single-pass sequences

Next, run the correction to generate corrected pacbio sequences. This example uses the Java utility above to convert the PacBio RS fasta file to fastq.

% java -jar convertFastaAndQualToFastq.jar pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fastq

Correcting and Assembling PacBio RS sequences

Run the correction to self-correct and assemble the PacBio RS sequences.

% <wgs>/<Linux-amd64>/bin/PBcR PBcR -length 500 -partitions 200 -l lambda -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000

After the correction runs, you will have an assembly in lambda/9-terminator/asm.ctg.fasta. You can check the statistics.

 head -n 40 lambda/9-terminator/asm.qc
[Scaffolds]
TotalScaffolds=1
TotalContigsInScaffolds=1
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=49698
MeanBasesInScaffolds=49698
MinBasesInScaffolds=49698
MaxBasesInScaffolds=49698
N25ScaffoldBases=49698
N50ScaffoldBases=49698
N75ScaffoldBases=49698

TotalSpanOfScaffolds=49698
MeanSpanOfScaffolds=49698
MinScaffoldSpan=49698
MaxScaffoldSpan=49698
IntraScaffoldGaps=0
2KbScaffolds=1
2KbScaffoldSpan=49698
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,49698,49698,49698,0,7180000000002
total=1,49698,49698,49698,0

[Contigs]
TotalContigsInScaffolds=1
TotalBasesInScaffolds=49698
TotalVarRecords=0
MeanContigLength=49698
MinContigLength=49698
MaxContigLength=49698
N25ContigBases=49698
N50ContigBases=49698
N75ContigBases=49698

[BigContigs_greater_10000]

In this case, the phage was assembled into a single contig.

Correction using complementary data

First, convert illumina.fastq to CA frg format:

% cd sampleData/
% <wgs>/<Linux-amd64>/bin/fastqToCA -libraryname illumina -technology illumina -type sanger -innie -reads illumina.fastq > illumina.frg
% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l lambdaIll -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000 illumina.frg

Note that the illumina.frg file was added the command line. The pipeline will correct and assemble the PacBio RS sequences. When it is finished, you will have an assembled lambda in lambdaIll/9-terminator/asm.ctg.fasta. You can check the assembly statistics

head -n 40 lambdaIll/9-terminator/asm.qc
[Scaffolds]
TotalScaffolds=1
TotalContigsInScaffolds=1
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=48112
MeanBasesInScaffolds=48112
MinBasesInScaffolds=48112
MaxBasesInScaffolds=48112
N25ScaffoldBases=48112
N50ScaffoldBases=48112
N75ScaffoldBases=48112

TotalSpanOfScaffolds=48112
MeanSpanOfScaffolds=48112
MinScaffoldSpan=48112
MaxScaffoldSpan=48112
IntraScaffoldGaps=0
2KbScaffolds=1
2KbScaffoldSpan=48112
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,48112,48112,48112,0,7180000000002
total=1,48112,48112,48112,0

[Contigs]
TotalContigsInScaffolds=1
TotalBasesInScaffolds=48112
TotalVarRecords=0
MeanContigLength=48112
MinContigLength=48112
MaxContigLength=48112
N25ContigBases=48112
N50ContigBases=48112
N75ContigBases=48112

[BigContigs_greater_10000]

Using either self or hybrid correction, the phage can be accurately assembled into a single contig.

Assembling an E. coli

This example demonstrates the pipeline on a bacterial genome. Download the a random 30X subset of P5-C3 chemistry sequences (after filtering with SMRTportal). The full SMRTcell is also available. At this coverage, PBcR will automatically enable it's sensitive mode which uses adjusted MHAP parameters and PBDAGCON (which is included in the Linux/source distributions).

% tar xvzf selfSampleData.tar.gz
selfSampleData/reference.fasta
selfSampleData/pacbio.spec
selfSampleData/pacbio_filtered.fastq
selfSampleData/README

Run the pipeline to generate an assembly

% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l K12 -s pacbio.spec -fastq pacbio_filtered.fastq genomeSize=4650000

The pipeline will automatically correct and assemble the data. The correction and assembly should take <30 minutes on a 16-core system. The corrected sequences are in K12.fastq (the longest 25X subset is K12.longest25.fastq). We can see the assembled ecoli in in K12/9-terminator/asm.ctg.fasta and check the stats for the assembly:

head -n 40 K12/9-terminator/asm.qc
[Scaffolds]
TotalScaffolds=1
TotalContigsInScaffolds=1
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=4635585
MeanBasesInScaffolds=4635585
MinBasesInScaffolds=4635585
MaxBasesInScaffolds=4635585
N25ScaffoldBases=4635585
N50ScaffoldBases=4635585
N75ScaffoldBases=4635585
ScaffoldAt1000000=4635585

TotalSpanOfScaffolds=4635585
MeanSpanOfScaffolds=4635585
MinScaffoldSpan=4635585
MaxScaffoldSpan=4635585
IntraScaffoldGaps=0
2KbScaffolds=1
2KbScaffoldSpan=4635585
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,4635585,4635585,4635585,0,7180000000002
total=1,4635585,4635585,4635585,0

[Contigs]
TotalContigsInScaffolds=1
TotalBasesInScaffolds=4635585
TotalVarRecords=0
MeanContigLength=4635585
MinContigLength=4635585
MaxContigLength=4635585
N25ContigBases=4635585
N50ContigBases=4635585
N75ContigBases=4635585
ContigAt1000000=4635585

If you have MUMmer available, you can run:

% dnadiff reference.fasta K12/9-terminator/asm.ctg.fasta
% head -n 20 out.report
                               [REF]                [QRY]
[Sequences]
TotalSeqs                          1                    1
AlignedSeqs               1(100.00%)           1(100.00%)
UnalignedSeqs               0(0.00%)             0(0.00%)

[Bases]
TotalBases                   4639560              4635585
AlignedBases         4633780(99.88%)     4635509(100.00%)
UnalignedBases           5780(0.12%)            72(0.00%)

[Alignments]
1-to-1                             9                    9
TotalLength                  4634882              4633701
AvgLength                  514986.89            514855.67
AvgIdentity                    99.96                99.96

After assembly, polishing with Quiver is recommended to get the final consensus sequence. The full input data (bax.h5 files) are required to perform this polishing. If you have downloaded it and have SMRTportal installed and in your path, you can run:

find . |grep bax.h5 > K12.fofn
pbalign --nproc 16 --forQuiver K12.fofn K12/9-terminator/asm.ctg.fasta K12.cmp.h5
quiver -j 16 K12.cmp.h5 -r K12/9-terminator/asm.ctg.fasta -o K12/9-terminator/asm.quiver.fasta -o K12/9-terminator/asm.quiver.gff

Assembling a MinION dataset

This example demonstrates the pipeline on recently released Oxford Nanopore MinION dataset from Loman, NJ, Quick J, and Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. bioRxiv, 2015 Download the dataset of 30x of 7.3 chemistry 2D reads after conversion with poretools. The original data is also available (ERX708228-ERX708231).

% tar xvzf sampleDataOxford.tar.gz
sampleDataOxford/reference.fasta
sampleDataOxford/README
sampleDataOxford/oxford.spec
sampleDataOxford/all.2d.fastq

Run the pipeline to generate an assembly

% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l K12_OX -s oxford.spec -fastq all.2d.fastq  genomeSize=4650000

The correction should take <5 minutes on a 16-core system. The assembly will take approximately 1.5hrs as it uses a higher overlapping error rate than the default for PacBio RS corrected sequences. In total, less than 16CPU hours are required for correction and assembly (<1 for MHAP). Afterwards, we can see the assembled ecoli K12_OX/9-terminator/asm.ctg.fasta. We can check the stats for the assembly:

% head -n 40 K12_OX/9-terminator/asm.qc
[Scaffolds]
TotalScaffolds=1
TotalContigsInScaffolds=1
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=4329931
MeanBasesInScaffolds=4329931
MinBasesInScaffolds=4329931
MaxBasesInScaffolds=4329931
N25ScaffoldBases=4329931
N50ScaffoldBases=4329931
N75ScaffoldBases=4329931
ScaffoldAt1000000=4329931

TotalSpanOfScaffolds=4329931
MeanSpanOfScaffolds=4329931
MinScaffoldSpan=4329931
MaxScaffoldSpan=4329931
IntraScaffoldGaps=0
2KbScaffolds=1
2KbScaffoldSpan=4329931
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,4329931,4329931,4329931,0,7180000000002
total=1,4329931,4329931,4329931,0

[Contigs]
TotalContigsInScaffolds=1
TotalBasesInScaffolds=4329931
TotalVarRecords=0
MeanContigLength=4329931
MinContigLength=4329931
MaxContigLength=4329931
N25ContigBases=4329931
N50ContigBases=4329931
N75ContigBases=4329931
ContigAt1000000=4329931

If you have MUMmer available you can run:

% cd K12_OX/9-terminator
% dnadiff ../../reference.fasta asm.ctg.fasta
% head -n 20 out.report
                               [REF]                [QRY]
[Sequences]
TotalSeqs                          1                    1
AlignedSeqs               1(100.00%)           1(100.00%)
UnalignedSeqs               0(0.00%)             0(0.00%)

[Bases]
TotalBases                   4641652              4329920
AlignedBases         4557124(98.18%)      4303241(99.38%)
UnalignedBases          84528(1.82%)         26679(0.62%)

[Alignments]
1-to-1                           404                  404
TotalLength                  4582863              4324164
AvgLength                   11343.72             10703.38
AvgIdentity                    94.04                94.04

In this case, the E. coli genome is assembled into a single contig. The resulting assembly is approximately 94% identity to the reference and includes 98% of the reference genome. The corrected read identity/assembly accuracy can be improved by performing multiple passes of correction. The identity can also be improved to >99% (>99.5% with two rounds) by using nanopolish from Loman et. al. 2015. If you have Illumina data available, you can also use a tool like Pilon to improve assembly consensus.

% dnadiff reference.fasta polished.fa
% head -n 20 out.report
                               [REF]                [QRY]
[Sequences]
TotalSeqs                          1                    1
AlignedSeqs               1(100.00%)           1(100.00%)
UnalignedSeqs               0(0.00%)             0(0.00%)

[Bases]
TotalBases                   4641652              4617457
AlignedBases         4622650(99.59%)      4615685(99.96%)
UnalignedBases          19002(0.41%)          1772(0.04%)

[Alignments]
1-to-1                            71                   71
TotalLength                  4639588              4625432
AvgLength                   65346.31             65146.93
AvgIdentity                    99.24                99.24

Important Parameter Considerations

On a single system, no parameter adjustment should be necessary. PBcR will detect your available resources and appropriately adjust parameters to fill the memory/CPUs on your system. The default options have been tested on a variety of genomes and systems and should work well for the majority of users. The options below are for advanced users who would like to optimize their assembly.

Assembly of Corrected Sequences

By default, PBcR will assemble the dataset with default parameters. The automated parameters have been tested on a variety of haploid or inbred eukaryotic genomes but need to modified for larger or diploid genomes. For these cases, you can set the assembly parameters in the spec file provided to PBcR as:

asmOvlErrorRate=0.10
asmUtgErrorRate=0.10
asmCnsErrorRate=0.10
asmCgwErrorRate=0.10
asmOBT=1
asmObtErrorRate=0.08
asmObtErrorLimit=4.5
utgGraphErrorRate=0.05
utgMergeErrorRate=0.05

ovlHashBits=24
ovlHashLoad=0.80

For most users, it should be unnecessary to run assembly by hand. However, if you want to run the assembly by hand read on. You can download an example spec file here which is tuned for a 16-core 64GB machine. You must adjust the memory and thread settings manually to match your system. To avoid automatically assembling your corrected sequences, you can specify:

assemble=0

In all cases, the longest 25X subset of the corrected data is recommended for assembly. This is because Celera Assembler, like many OLC assemblers, is sensitive to high coverage and noise, especially near the ends of sequences. To remove some of this noise, we recommend using the longest 25X of sequence to assemble. Given an E. coli K12 genome (4.65Mbp genome size), 25X coverage is approximately 116MBp. The commands below will subset an frg file to only 25X of the longest sequences.

% <wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore <your library name>.frg
% <wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 116000000 asm.gkpStore > <your library name>.25X.frg

After you have your longest corrected sequences, you can manually run an assembly:

% <wgs>/<Linux-amd64>/bin/runCA unitigger=bogart \
                          utgGraphErrorRate=0.05 utgGraphErrorLimit=3.25 \
                          utgMergeErrorRate=0.05 utgMergeErrorLimit=5.25batOptions=“-NS -RS -CS” obtErrorRate=0.08 obtErrorLimit=4.5 <your library name>.25X.frg

If you have less than 10X of PacBio sequences and other FRG files for other data types (Illumina, 454, etc), you can also add them to the runCA command above.

Alternate Assemblers

You do not have to use Celera Assembler to assemble the corrected sequencing data. Any OLC assembler will work well with the corrected data. Falcon 0.1.3 is one example. If you have assembled the E. coli, you can re-assemble it using Falcon instead of CA. After you have followed the Falcon instructions and have it installed run:

$ mkdir falcon_asm
$ cd falcon_asm
$ ln -s ../K12.longest25.fasta in.fasta
$ python falcon_overlap.py --min_len 3000 --n_core 16 --d_core 3 in.fasta > preads.ovlp
$ python falcon_asm.py preads.ovlp in.fasta
$ python falcon_fixasm.py

You will have an E. coli assembly in falcon_asm/primary_tigs_c.fa.

Running on a grid system

Currently, SGE, PBS/Torque, and LSF are supported and have been tested. It is possible the code will work on other grid systems but this is not tested/supported. There is external user support for SLURM. Each node in the grid must be a submit node. The default parameters assume SGE and default options for requesting memory/CPU cores. Since there is no standardized method for requesting memory/cores on SGE, you will likely need to to update these parameters. The minimum parameters that are needed to specify memory/threads are:

useGrid=1
scriptOnGrid=1
ovlCorrOnGrid=1
frgCorrOnGrid=1

ovlMemory = 32
ovlThreads = 16
ovlStoreMemory = 32000
threads = 16
cnsConcurrency = 8

gridOptionsCorrection=-pe threads 16 -l mem=10GB
gridOptionsOverlap=-pe threads 16 -l mem=2GB
gridOptionsMhap=-pe threads 16 -l mem=2GB
gridOptionsConsensus=-pe threads 8
gridOptionsScript=-pe threads 16 -l mem=10GB
gridOptionsFragmentCorrection=-pe threads 2 -l mem=4GB
gridOptionsOverlapCorrection=-pe threads 1 -l mem=4GB

The parallel environment (-pe) varies from system to system so you may need to change threads to whatever is appropriate on your system. The memory parameter may also be different on your system.

Several steps run as large array jobs (Mhap, Overlap, Consensus) so you want to set the memory/threads parameters for those to what is available on the most common node on your system (to maximize throughput). For other jobs which run as a single multi-core job (Correction, Script) you can set them to utilize a high-memory machine available on your cluster. Make sure that whatever you request in gridOptions for Mhap/Overlap matches what you set for ovlMemory and ovlThreads, otherwise jobs may overload/underload your systems or not run at all. If you want to restrict the number of jobs running on your cluster simultaneously you can add the -tc option to the array jobs (Mhap, Overlap, Consensus) specifying the number of jobs that may be concurrently scheduled. For example, if you set -tc 20 with the above config, the run would be limited to 320 slots for Mhap (16slots per job, 20 jobs).

If you have LSF, you also must specify

gridEngine=LSF

as well as updating the gridOptions to the flags to the equivalents on an LSF system or

gridEngine=PBS

for PBS.

Low Coverage Assembly

By default, self-correction is performed with a probabilistic overlapping algorithm, named MHAP. The default parameters have been set to be a good compromise between speed and sensitivity. If you have low-quality data (< 85% identity) or have shorter or older chemistries, it may be beneficial to increase sensitivity. The sensitive parameters are automatically enabled below 50X coverage. You can increase sensitivity on any dataset by specifying

-sensitive

as a parameters to PBcR. This will increase the overlap sensitivity to over 85% and turn on PBDAGCON (if it is available).

If you have <30X coverage, you can further adjust parameters. At this coverage, there is not sufficient depth to generate >99% accuracy corrected reads. Thus, the parameters below decrease the average identity of the corrected sequences and increase the assembly error rate. You can add the following:

# lower QV of corrected sequences from 99+ to 97-98
QV=52

# increase assembly error rate
asmOvlErrorRate=0.1
asmUtgErrorRate=0.06
asmCgwErrorRate=0.1
asmCnsErrorRate=0.1
asmOBT=1
asmObtErrorRate=0.08
asmObtErrorLimit=4.5
utgGraphErrorRate=0.05
utgMergeErrorRate=0.05

to your spec file if you would like to assemble <30X data. With these parameters, PBcR can close a bacterial genome (E. coli) with 20X P5 data and is comparable to hybrid tools at 20X on a eukaryotic genome (see above). These parameters for <30X assembly are experimental and have only been tried on E. coli, S. cerevisiae, and a 2.4Gbp eukaryote so your mileage may vary. PBDAGCON is also required for these coverage levels.

Controlling Consensus Module

PBcR supports two alternate consensus modules. The default is falcon_sense. An alternate is PBDAGCON, also utilized by HGAP. PBDAGCON must be installed on your system and available in your path. If you have SMRTportal installed and available in your path, you already have PBDAGCON. It can be beneficial below 60X coverage but in most cases, falcon_sense is sufficient for good assembly. PBDAGCON is automatically enabled as part of the sensitive mode above or can be enabled with

pbcns=1

Therefore, it may be beneficial to run falcon_sense to generate an initial assembly rapidly and then run PBDAGCON if the assembly is not satisfactory. To generate an assembly using both methods for the phage example above run

% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l lambda -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000 -noclean

The -noclean option will prevent the pipeline from removing intermediate files. Once the assembly using falcon_sense is generated, you can see its stats:

% head -n 40 lambda/9-terminator/asm.qc
[Scaffolds]
TotalScaffolds=1
TotalContigsInScaffolds=1
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=49698
MeanBasesInScaffolds=49698
MinBasesInScaffolds=49698
MaxBasesInScaffolds=49698
N25ScaffoldBases=49698
N50ScaffoldBases=49698
N75ScaffoldBases=49698

TotalSpanOfScaffolds=49698
MeanSpanOfScaffolds=49698
MinScaffoldSpan=49698
MaxScaffoldSpan=49698
IntraScaffoldGaps=0
2KbScaffolds=1
2KbScaffoldSpan=49698
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,49698,49698,49698,0,7180000000002
total=1,49698,49698,49698,0

[Contigs]
TotalContigsInScaffolds=1
TotalBasesInScaffolds=49698
TotalVarRecords=0
MeanContigLength=49698
MinContigLength=49698
MaxContigLength=49698
N25ContigBases=49698
N50ContigBases=49698
N75ContigBases=49698

[BigContigs_greater_10000]

In this case, the genome is assembled into one contigs. To try the assembly using PBDAGCON, move the old assembly

% mkdir first_asm
% mv lambda* first_asm

Next, clean up the existing results

% rm templambda/[0-9]*.success
% rm templambda/runPartition.sh

and finally re-launch the pipeline

% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l lambda -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000 pbcns=1

PBcR will skip all steps before consensus and generate a new assembly.

% head -n 40 lambda/9-terminator/asm.qc
-bash-3.2$ head -n 40 lambda/9-terminator/asm.qc 
[Scaffolds]
TotalScaffolds=1
TotalContigsInScaffolds=1
MeanContigsPerScaffold=1.00
MinContigsPerScaffold=1
MaxContigsPerScaffold=1

TotalBasesInScaffolds=49508
MeanBasesInScaffolds=49508
MinBasesInScaffolds=49508
MaxBasesInScaffolds=49508
N25ScaffoldBases=49508
N50ScaffoldBases=49508
N75ScaffoldBases=49508

TotalSpanOfScaffolds=49508
MeanSpanOfScaffolds=49508
MinScaffoldSpan=49508
MaxScaffoldSpan=49508
IntraScaffoldGaps=0
2KbScaffolds=1
2KbScaffoldSpan=49508
MeanSequenceGapLength=0

[Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID]
0=1,49508,49508,49508,0,7180000000002
total=1,49508,49508,49508,0

[Contigs]
TotalContigsInScaffolds=1
TotalBasesInScaffolds=49508
TotalVarRecords=0
MeanContigLength=49508
MinContigLength=49508
MaxContigLength=49508
N25ContigBases=49508
N50ContigBases=49508
N75ContigBases=49508

[BigContigs_greater_10000]

The phage is successfully assembled into a single contig by both methods.

Known Issues

Celera Assembler Known Issues

For the latest issues found in the Celera Assembler release, please see known issues page. You can also check the Celera Assembler bugs page for any reported issues.

Correcting Large (> 100Mbp) Genomes

PBcR supports a novel overlapping module for efficiently identifying all-pairs overlaps in high-error long-read data. CA 8.3 is recommended for large genomes sequenced with PacBio data and using self-correction. If you are using hybrid correction (correcting PacBio with other data), then follow the instructions below for CA 8.1. MHAP has been tested on S. cerevisiae, A. thaliana, and D. melanogaster. To use the pipeline 50X of PacBio RS sequencing data is recommended. If you have less coverage than this, you can still use PBcR and increase the sensitivity at the expense of runtime.

PBcR will auto-detect available memory and cores if you are running on a single system. You can limit these the same way as the example spec file from the E. coli by specifying

ovlMemory = <GB per job>

More memory will speed up computation and create fewer partitions, less memory will create more partitions. When running on a grid, you should specify the memory available on each node being utilized (see above). A single machine is sufficient for medium-size eukaryotes (<300Mbp) and PBcR should complete in a few days, however, a cluster is recommended for mammalian-sized genomes. A local scratch disk is recommended for best performance on the grid.

Additionally, CA 8.2 is available on AWS with instruction and large genome examples

Correcting Large (> 100Mbp) Genomes (Using high-identity data or CA 8.1)

PBcR has support for using BLASR when correcting large genomes starting in CA 8.1. If you do not have BLASR/SMRTportal installed, follow the instructions below for CA 7.0.

maxGap=1500
ovlHashBlockLength=1000000000
ovlRefBlockLength=1000000000

If you are using high-identity sequences for correction, add the parameter

blasr=-noRefineAlign -advanceHalf -noSplitSubreads -minMatch 10 -minPctIdentity 70 -bestn 24 -nCandidates 24

For self correction, add the parameter

blasr=-minReadLength 200 -maxScore -1000 -maxLCPLength 16 -bestn 24 -nCandidates 24

instead.

When using BLASR, each overlap job should take less than 8GB of memory. Thus, the limitation is usually the number of cores on the system, not the memory. You can set ovlThreads=16 and ovlConcurrency to be:

<total cores> / <ovlThreads>

Correcting Large (> 100Mbp) Genomes With CA 7.0 or older (not recommended)

ovlHashBits = 25
ovlHashBlockLength = 1000000000
ovlRefBlockSize =  100000000
doOverlapBasedTrimming = 0

This will make each overlapper job use approximately 18GB of RAM (up from the default of 6GB of ram). You should set the ovlConcurrency to be:

<total memory> / 18

For example, for a 256GB machine, ovlConcurrency would be 14. The ovlThreads parameter should be set to:

<total cores> / ovlConcurrency

So for a 256GB machine with 32 cores, ovlThreads would be 2 (assuming ovlConcurrency was set to 14 above). When using an SGE grid, the sgeOverlap parameter should also be set to request the appropriate threads and memory:

sgeOverlap = -pe threads 2 -l mem=10GB

For full details on tuning these parameters, please see the Celera Assembler runCA wiki page.

If you already have a correction running, you can tell the number of overlap jobs created by running

% wc -l temp<your library name>/0-overlaptrim-overlapper/ovlopt

If the number of overlap jobs is large (over 400), then the parameters above are recommended over the default settings.

No Overlaps Found

The PacBio SMRT-portal version 1.3.0 outputs a non-standard fastq file from their filtering pipeline. This causes the PBcR pipeline to either stop with an error or output no corrected sequences. You can check if your fastq file is the cause of the problem if it splits the sequence and quality on multiple lines:

@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG
TCANCCCTGCATTATAAGAAGTACCGCAGGCGACG
ATCTGTACATGTTCCACTTTTGCCAGAATCTCTGCC
+
EEE#EGGGGGIGIDIGHIIEIHIIIIIGDIIIGGFHHIIIGIIIIFID
HHIGHIHHGHIHIGHGIGHGGIC
@HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG
TAANGGGCTGACAACGCATTACACAAAACTCAAAC
ACAACAACCGTAACCACCGCGGTTCAATGGGACTG
G
+
EEE#EGGGGGIIIIIIHIIIIIHHIIIHIIIIIIIIDIIIGGGGGDGGG
GIEIIHIDHIIHIFHEGGHIF=

versus the expected format of:

@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG
TCANCCCTGCATTATAAGAAGTACCGCAGGCGACGATCTGTACATGTTCCACTTTTGCCAGAATCTCTGCC
+
EEE#EGGGGGIGIDIGHIIEIHIIIIIGDIIIGGFHHIIIGIIIIFIDHHIGHIHHGHIHIGHGIGHGGIC
@HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG
TAANGGGCTGACAACGCATTACACAAAACTCAAACACAACAACCGTAACCACCGCGGTTCAATGGGACTGG
+
EEE#EGGGGGIIIIIIHIIIIIHHIIIHIIIIIIIIDIIIGGGGGDGGGGIEIIHIDHIIHIFHEGGHIF=

As a workaround, you can use the fasta file and convert it to fastq following the instructions above. Alternatively, you can use a perl script to convert non-standard fastq files to ones PBcR expects. Thank you to Lee Katz for providing the script. To convert a non-standard fastq file named nonstandard.fastq to the correct format run the command:

% perl run_assembly_convertMultiFastqToStandard.pl -i nonstandard.fastq -o standard.fastq

Run the script with no parameters to get help.

Cite

Berlin K, Koren S, Chin CS, Drake PJ, Landolin JM, Phillippy AM Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing. Nature Biotechnology. (2015).

Koren S, Harhay GP, Smith TPL, Bono JL, Harhay DM, Mcvey SD, Radune D, Bergman NH, Phillippy AM. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biology 14:R101 (2013)

Koren S, Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., and Phillippy, A. M. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. (2012)

Contact

If you encounter issues or have questions, please contact the authors of the pipeline, Sergey Koren (sergek AT umd.edu) or Adam M. Phillippy (aphillippy AT gmail.com)