If you are assembling PacBio or Oxford Nanopore data, please use Canu instead. Celera Assembler is no longer being maintained.
Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing. Nature Biotechnology. (2015).
Konstantin Berlin*, Sergey Koren*, Jason Chin, James P. Drake Jane M. Landolin Adam M. Phillippy,
- The ‘MHAP’ paper demonstrates that probabilistic alignment methods are well suited to address the read length and error rate of single-molecule and can significantly improve on the runtimes of existing approaches. Please cite for projects using non-hybrid CA v8.2 on.
Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biology 14:R101 (2013).
- The ‘assembly complexity’ paper demonstrates that a majority of microbial genomes can be fully resolved to finished-grade accuracy using PacBio SMRT sequencing based on an analysis of all genomes in Genbank as of 2012. An updated version of PBcR for non-hybrid hierarchical assembly is also presented. Please cite for projects using non-hybrid PBcR with CA prior to v8.2.
Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. (2012).
- The ‘PBcR’ paper presents the first hierarchical assembly method (overlap, correct, assemble) and demonstrates that long, noisy reads can be used for assembly after correction. The hierarchical approach is now used by many single-molecule assembly tools, including HGAP, Falcon, and others. Please cite for projects utilizing hybrid PBcR correction or assembling long-read (>2Kbp) reads with Celera Assembler.
About PacBio RS
The PacBio RS sequencing instrument from Pacific Biosciences is a single-molecule sequencer that is able to generate long-reads (CLR). Unfortunately, these sequences have a low accuracy, averaging only 85-88%. While It is possible to read each circularized molecule multiple times to get higher accuracy sequences (CCS), this produces significantly shorter sequences.
About Oxford Nanopore MinION
The Oxford Nanopore MinION sequencing instrument from Oxford Nanopore Technologies is a single-molecule sequencer. As it is based on nanopore sensing and requires no optics, it is a miniaturized device, approximately the size of a thumb-drive. The most recent chemistry (7.3) and basecaller (Metrichor >1.9) can produce 2D (sequence read in both the template and reverse strand) data with 78-85% accuracy (Loman et. al. 2015). At this error rate, PBcR can work out of the box with nanopore data. .
The PBcR pipeline enables the use of the high-noise long-read sequences produced by a single-molecule instrument. It was the first hybrid assembly pipeline. The hybrid approach works by first improving the accuracy of the single-molecule sequences and then assembling the improved sequences. To algorithmically deal with the error, we support solely high-noise data from recent chemistries (C2 for PacBio or 7.3 for Oxford) or high-identity sequences (454, Illumina, or PacBio circularized sequences) to re-call accurate consensus.
The pipeline outputs fastq, fasta, and qual formats as well as a Celera Assembler FRG file and will automatically assemble the data.
If you are having trouble, please check the known issues and known Celera Assembler issues to see if your problem is documented. If you are working with large genomes (>100Mbp), check the recommended parameters section for instructions.
- May 24, 2015 - PBcR and CA 8.3rc2 as source or pre-compiled for Linux is now available. CA 8.3rc2 includes support for Oxford Nanopore parameters as well as the PBS/Torque cluster system. Computing overlaps for 70X of P6 human data now takes < 8K cpu hours on an NFS file system. Performance on low-coverage (<30X) parameters is improved with adjusted parameters, requiring only 20X P5 data to close an E. coli genome. See the release notes for details.
- March 25, 2015 - Added experimental suggested parameters for very low-coverage assembly (30X or less).
- February 26, 2015 - PBcR and CA example assembly of E. coli K12 using Oxford Nanopore MinION data from Loman et. al. 2015. This data was originally assembled using a novel correction tool, nanocorrect, and CA.
- February 23, 2015 - PBcR and CA 8.3rc1 as source or pre-compiled for Linux is now available. CA 8.3rc1 includes an updated version of MHAP which is approximately 4-fold faster. Computing overlaps for 54X of human data now takes < 20K cpu hours on an NFS file system instead of 80K. Performance on low-coverage datasets is also significantly improved, requiring only 30X P5 data to close an E. coli genome. See the release notes for an example on S. cerevisae.
- July 19, 2014 - PBcR and CA 8.2 as source or pre-compiled for Linux is now available. PBcR now incorporates a novel probabilistic overlapper for self-correction of sequences named MHAP. This allows assembly of prokaryotic genomes in < 30 minutes on a typical desktop and assembly of small eukaryotic genomes in < 2 days. For best results, java 1.8r25 or newer is recommended to use MHAP.
- January 13, 2014 - PBcR followed by assembly using CA 8.1 is used to assemble D. melanogaster using only PacBio RS II sequencing reads. The CA assembly places entire chromosome arms de novo into fewer pieces than the current version of the reference genome (Release 5). The largest contig comprises the majority of chromosome arm 3L. All autosomal arms (2L, 2R, 3R, and 4) are in less than six large contigs. More details, raw and corrected data, as well as the assembly are available at the PacBio Blog and CBCB.
- December 17, 2013 - CA 8.1 as source or pre-compiled for Linux is available. CA 8.1 supports self-correction of PacBio as well as correction of PacBio sequences using complementary high-identity sequencing (such as Illumina, 454, or CCS).
- September 13, 2013 - Final version of Reducing assembly complexity of microbial genomes with single-molecule sequencing is now published in Genome Biology. Supporting materials and data are also posted at CBCB
- April 15, 2013 - Pre-print of Reducing assembly complexity of microbial genomes with single-molecule sequencing posted on arXiv. Supporting materials and data are also posted at CBCB
- April 4, 2013 - Updated documentation to describe recommended Celera Assembler parameters and Quiver polishing of consensus for assembly of corrected sequences. Please see assembly of corrected sequences for more information.
- March 19, 2013 - Updated documentation to describe new option of self-correction. PBcR can correct PacBio RS C2 data (or newer) using solely long-read low-accuracy data. Please see self-correction for more information.
- March 18, 2013 - Updated documentation to describe new support for both BLASR or Bowtie 2 for alignment (in addition to CA's overlapper).
- August 9, 2012 - An NGS Leaders webinar by Adam Phillippy and Sergey Koren is now available at YouTube.
- July 1, 2012 - The paper on correction and assembly is now available as an advance online publication from Nature Biotechnology.
- June 20, 2012 - New known issue: using Illumina sequences shorter than 100bp for correction requires adjusted parameters. Please see the known issues for the required parameters.
- May 4, 2012 - New known issue: PacBio SMRT-portal 1.3.0 outputs fastq files in a non-standard format. If you encounter errors running the pipeline or have no corrected sequences output, please see the known issues.
- Feb. 15, 2012 - Presentation on error correction and assembly at AGBT
The PBcR pipeline is available in CA 7.0 or newer, however CA 8.3 is recommended. The self-correction feature, requires at least CA 8.2 but Celera Assembler 8.3 is recommended. The instructions below assume you are using the CA 8.3 release. If you are using an older version of CA (8.1 or earlier), please see the legacy instructions.
usage:PBcR [options] -s spec.file fastqFile=fastqfile [frg] -length Minimum length of PacBio RS fragment to keep. -partitions Number of partitions for consensus. -l libraryname Name of the library; freeformat text. -t threads Number of threads to use for correction. -noclean Do not clean up after the correction finishes. By default, the temp<libraryname> directory will be removed on successful completion of the correction.
The pipeline will corrected the sequences and output a CA FRG file consisting of a single LIB message to <libraryname>.frg. It will also output the corrected fasta, qual, and fastq sequences (named <libraryname>.fasta, <libraryname>.qual, and <libraryname>.fastq). Finally, it will assemble the data with default parameters.
In addition, the pipeline will also output <libraryname>.log file which indicates the source for each corrected sequence. The format of the file is:
% head <libraryname>.log INPUT_NAME OUTPUT_NAME SUBREAD START END LENGTH m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8962/3861_17595 pacbio_3307_1 1 53 13682 13629 m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8965/6675_14278 pacbio_3309_1 1 51 7551 7500 m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8977/0_17507 pacbio_3316_1 1 63 17456 17393
The first column is the input name of the PacBio RS sequencing read which was corrected. The second column is the name of the output corrected sequence. The third column indicates if the input PacBio RS sequence was split into more than a single fragment. This can happen due to SMRTbell adapters or issues correcting a subset of the long sequence. The next columns indicate the positions in the original PacBio RS sequence where the corrected read came from. The final column is the length of the corrected sequence. However, the log file and positions are estimates because they are a pre-consensus view. The consensus module may trim/split/discard sequences if there is insufficient coverage to achieve a minimum QV for the corrected sequence.
The <libraryname>.correction.* and <libraryname>.err files may also get created based on whether the pipeline runs on the grid or a single machine. These provide debugging output. If you encounter issues with the pipeline or have questions, please include these *.err files with your email.
The -libraryname option is mandatory. It provides a UID name for the library that the reads are placed into. This name is completely up to the end user, but must contain no white spaces or commas.
The -s pacbio.spec option is mandatory. It should point to a a CA spec file to specify how to overlap the illumina and PacBio data for correction (which comes included with the sample data.) On a shared-memory multi-core system, the spec file can be empty and the pipeline will auto-detect the available resources. On a grid environment (SGE/LFS/PBS) see more information in the grid configuration section.
The fastqFile option is mandatory. It specifies the pacbio fragments to correct. The correction will also work with unfiltered reads if you don't have a filtered fastq file.
The options (-length, -partitions, and -t, -noclean) are optional.
You can specify an arbitrary number of FRG files to be used as the trusted sequences for correction. If you have 20X+ coverage of C2 sequencing data (or newer) you can also perform self-correction. This option is automatically enabled if you omit the FRG files. A total of 50+X coverage is recommended but good assemblies can be generated from as low as 15X. Below is an example on S. cerevisiae comparing CA 8.2 and CA 8.3 performance at varying coverages. If you have less than 30X, see the recommended parameters below.
Below 15X, a hybrid approach is likely to outperform self-correction. Input FRG files can be any format CA accepts. Please see the FastqToCA and SffToCA pages for information on inputting Illumina or 454 data. Generally, you don't need more than about 50X of high-identity sequences for correction. Note that this mode of PBcR is no longer being updated and is significantly slower than self-correction due to its reliance on BLASR (or a built-in module) for overlap computation.
Inputting Single-Molecule Sequences
The PBcR pipeline expects single-molecule sequences in fastq format (with sanger (PHRED32) quality values). If you already have fasta files, you can download a Java utility to convert the fasta files into PBcR compatible fastq files:
java -jar convertFastaAndQualToFastq.jar myfile.fasta > myfile.fastq
NOTE: If you are using the SMRT-portal version 1.3, please see the known issues if you encounter errors or no corrected sequences are output.
Self-Correction With C2/7.3 Sequences (or newer)
Starting with the C2 chemistry (PacBio, and the more recent chemistries), or 7.3 2D sequences (MinION), you can self-correct PacBio RS sequences without any high-identity technology. Celera Assembler 8.3 is recommended. Approximately 50X (or more) of sequencing is ideal. If you have less than 15X PacBio RS sequencing, then you can use hybrid correction with high-identity data. While PBcR supports high-identity data for correction, this mode is no longer being updated, is not supported for MinION data, and is significantly slower than self-correction due to its reliance on BLASR (or a built-in module) for overlap computation.
To run the pipeline, follow the usage instructions below.
Correction Using Complementary High-Identity Sequences
If you have less than 15X sequencing or older chemistry, then it is recommended to combine the PacBio data with a complementary high-identity sequencing dataset. These can be Illumina, 454, Ion Torrent, or PacBio CCS reads. Note that this mode of PBcR is no longer being updated and is significantly slower than self-correction due to its reliance on BLASR (or a built-in module) for overlap computation.
Inputting High-Identity Sequences
The PBcR pipeline accepts an arbitrary number of frg files at the end of the command.
- For Illumina data, use the fastqToCA command. For examples, see the phage example below and the usage page.
- For 454 data, use sffToCA.
- For PacBio RS CCS sequences, use the fastqToCA command. For examples, see the usage page.
Assembling a Lambda Phage
This tutorial covers both self and hybrid correction. Download the sample data.
% tar xvzf sampleData.tar.gz sampleData/ sampleData/illumina.fastq sampleData/reference.fasta sampleData/pacbio.filtered_subreads.fasta sampleData/pacbio.spec sampleData/pacbio.SGE.spec sampleData/README sampleData/convertFastaAndQualToFastq.jar
Prepare PacBio RS single-pass sequences
Next, run the correction to generate corrected pacbio sequences. This example uses the Java utility above to convert the PacBio RS fasta file to fastq.
% java -jar convertFastaAndQualToFastq.jar pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fastq
Correcting and Assembling PacBio RS sequences
Run the correction to self-correct and assemble the PacBio RS sequences.
% <wgs>/<Linux-amd64>/bin/PBcR PBcR -length 500 -partitions 200 -l lambda -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000
After the correction runs, you will have an assembly in lambda/9-terminator/asm.ctg.fasta. You can check the statistics.
head -n 40 lambda/9-terminator/asm.qc [Scaffolds] TotalScaffolds=1 TotalContigsInScaffolds=1 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=49698 MeanBasesInScaffolds=49698 MinBasesInScaffolds=49698 MaxBasesInScaffolds=49698 N25ScaffoldBases=49698 N50ScaffoldBases=49698 N75ScaffoldBases=49698 TotalSpanOfScaffolds=49698 MeanSpanOfScaffolds=49698 MinScaffoldSpan=49698 MaxScaffoldSpan=49698 IntraScaffoldGaps=0 2KbScaffolds=1 2KbScaffoldSpan=49698 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,49698,49698,49698,0,7180000000002 total=1,49698,49698,49698,0 [Contigs] TotalContigsInScaffolds=1 TotalBasesInScaffolds=49698 TotalVarRecords=0 MeanContigLength=49698 MinContigLength=49698 MaxContigLength=49698 N25ContigBases=49698 N50ContigBases=49698 N75ContigBases=49698 [BigContigs_greater_10000]
In this case, the phage was assembled into a single contig.
Correction using complementary data
First, convert illumina.fastq to CA frg format:
% cd sampleData/ % <wgs>/<Linux-amd64>/bin/fastqToCA -libraryname illumina -technology illumina -type sanger -innie -reads illumina.fastq > illumina.frg % <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l lambdaIll -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000 illumina.frg
Note that the illumina.frg file was added the command line. The pipeline will correct and assemble the PacBio RS sequences. When it is finished, you will have an assembled lambda in lambdaIll/9-terminator/asm.ctg.fasta. You can check the assembly statistics
head -n 40 lambdaIll/9-terminator/asm.qc [Scaffolds] TotalScaffolds=1 TotalContigsInScaffolds=1 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=48112 MeanBasesInScaffolds=48112 MinBasesInScaffolds=48112 MaxBasesInScaffolds=48112 N25ScaffoldBases=48112 N50ScaffoldBases=48112 N75ScaffoldBases=48112 TotalSpanOfScaffolds=48112 MeanSpanOfScaffolds=48112 MinScaffoldSpan=48112 MaxScaffoldSpan=48112 IntraScaffoldGaps=0 2KbScaffolds=1 2KbScaffoldSpan=48112 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,48112,48112,48112,0,7180000000002 total=1,48112,48112,48112,0 [Contigs] TotalContigsInScaffolds=1 TotalBasesInScaffolds=48112 TotalVarRecords=0 MeanContigLength=48112 MinContigLength=48112 MaxContigLength=48112 N25ContigBases=48112 N50ContigBases=48112 N75ContigBases=48112 [BigContigs_greater_10000]
Using either self or hybrid correction, the phage can be accurately assembled into a single contig.
Assembling an E. coli
This example demonstrates the pipeline on a bacterial genome. Download the a random 30X subset of P5-C3 chemistry sequences (after filtering with SMRTportal). The full SMRTcell is also available. At this coverage, PBcR will automatically enable it's sensitive mode which uses adjusted MHAP parameters and PBDAGCON (which is included in the Linux/source distributions).
% tar xvzf selfSampleData.tar.gz selfSampleData/reference.fasta selfSampleData/pacbio.spec selfSampleData/pacbio_filtered.fastq selfSampleData/README
Run the pipeline to generate an assembly
% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l K12 -s pacbio.spec -fastq pacbio_filtered.fastq genomeSize=4650000
The pipeline will automatically correct and assemble the data. The correction and assembly should take <30 minutes on a 16-core system. The corrected sequences are in K12.fastq (the longest 25X subset is K12.longest25.fastq). We can see the assembled ecoli in in K12/9-terminator/asm.ctg.fasta and check the stats for the assembly:
head -n 40 K12/9-terminator/asm.qc [Scaffolds] TotalScaffolds=1 TotalContigsInScaffolds=1 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=4635585 MeanBasesInScaffolds=4635585 MinBasesInScaffolds=4635585 MaxBasesInScaffolds=4635585 N25ScaffoldBases=4635585 N50ScaffoldBases=4635585 N75ScaffoldBases=4635585 ScaffoldAt1000000=4635585 TotalSpanOfScaffolds=4635585 MeanSpanOfScaffolds=4635585 MinScaffoldSpan=4635585 MaxScaffoldSpan=4635585 IntraScaffoldGaps=0 2KbScaffolds=1 2KbScaffoldSpan=4635585 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,4635585,4635585,4635585,0,7180000000002 total=1,4635585,4635585,4635585,0 [Contigs] TotalContigsInScaffolds=1 TotalBasesInScaffolds=4635585 TotalVarRecords=0 MeanContigLength=4635585 MinContigLength=4635585 MaxContigLength=4635585 N25ContigBases=4635585 N50ContigBases=4635585 N75ContigBases=4635585 ContigAt1000000=4635585
If you have MUMmer available, you can run:
% dnadiff reference.fasta K12/9-terminator/asm.ctg.fasta % head -n 20 out.report [REF] [QRY] [Sequences] TotalSeqs 1 1 AlignedSeqs 1(100.00%) 1(100.00%) UnalignedSeqs 0(0.00%) 0(0.00%) [Bases] TotalBases 4639560 4635585 AlignedBases 4633780(99.88%) 4635509(100.00%) UnalignedBases 5780(0.12%) 72(0.00%) [Alignments] 1-to-1 9 9 TotalLength 4634882 4633701 AvgLength 514986.89 514855.67 AvgIdentity 99.96 99.96
After assembly, polishing with Quiver is recommended to get the final consensus sequence. The full input data (bax.h5 files) are required to perform this polishing. If you have downloaded it and have SMRTportal installed and in your path, you can run:
find . |grep bax.h5 > K12.fofn pbalign --nproc 16 --forQuiver K12.fofn K12/9-terminator/asm.ctg.fasta K12.cmp.h5 quiver -j 16 K12.cmp.h5 -r K12/9-terminator/asm.ctg.fasta -o K12/9-terminator/asm.quiver.fasta -o K12/9-terminator/asm.quiver.gff
Assembling a MinION dataset
This example demonstrates the pipeline on recently released Oxford Nanopore MinION dataset from Loman, NJ, Quick J, and Simpson JT. A complete bacterial genome assembled de novo using only nanopore sequencing data. bioRxiv, 2015 Download the dataset of 30x of 7.3 chemistry 2D reads after conversion with poretools. The original data is also available (ERX708228-ERX708231).
% tar xvzf sampleDataOxford.tar.gz sampleDataOxford/reference.fasta sampleDataOxford/README sampleDataOxford/oxford.spec sampleDataOxford/all.2d.fastq
Run the pipeline to generate an assembly
% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l K12_OX -s oxford.spec -fastq all.2d.fastq genomeSize=4650000
The correction should take <5 minutes on a 16-core system. The assembly will take approximately 1.5hrs as it uses a higher overlapping error rate than the default for PacBio RS corrected sequences. In total, less than 16CPU hours are required for correction and assembly (<1 for MHAP). Afterwards, we can see the assembled ecoli K12_OX/9-terminator/asm.ctg.fasta. We can check the stats for the assembly:
% head -n 40 K12_OX/9-terminator/asm.qc [Scaffolds] TotalScaffolds=1 TotalContigsInScaffolds=1 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=4329931 MeanBasesInScaffolds=4329931 MinBasesInScaffolds=4329931 MaxBasesInScaffolds=4329931 N25ScaffoldBases=4329931 N50ScaffoldBases=4329931 N75ScaffoldBases=4329931 ScaffoldAt1000000=4329931 TotalSpanOfScaffolds=4329931 MeanSpanOfScaffolds=4329931 MinScaffoldSpan=4329931 MaxScaffoldSpan=4329931 IntraScaffoldGaps=0 2KbScaffolds=1 2KbScaffoldSpan=4329931 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,4329931,4329931,4329931,0,7180000000002 total=1,4329931,4329931,4329931,0 [Contigs] TotalContigsInScaffolds=1 TotalBasesInScaffolds=4329931 TotalVarRecords=0 MeanContigLength=4329931 MinContigLength=4329931 MaxContigLength=4329931 N25ContigBases=4329931 N50ContigBases=4329931 N75ContigBases=4329931 ContigAt1000000=4329931
If you have MUMmer available you can run:
% cd K12_OX/9-terminator % dnadiff ../../reference.fasta asm.ctg.fasta % head -n 20 out.report [REF] [QRY] [Sequences] TotalSeqs 1 1 AlignedSeqs 1(100.00%) 1(100.00%) UnalignedSeqs 0(0.00%) 0(0.00%) [Bases] TotalBases 4641652 4329920 AlignedBases 4557124(98.18%) 4303241(99.38%) UnalignedBases 84528(1.82%) 26679(0.62%) [Alignments] 1-to-1 404 404 TotalLength 4582863 4324164 AvgLength 11343.72 10703.38 AvgIdentity 94.04 94.04
In this case, the E. coli genome is assembled into a single contig. The resulting assembly is approximately 94% identity to the reference and includes 98% of the reference genome. The corrected read identity/assembly accuracy can be improved by performing multiple passes of correction. The identity can also be improved to >99% (>99.5% with two rounds) by using nanopolish from Loman et. al. 2015. If you have Illumina data available, you can also use a tool like Pilon to improve assembly consensus.
% dnadiff reference.fasta polished.fa % head -n 20 out.report [REF] [QRY] [Sequences] TotalSeqs 1 1 AlignedSeqs 1(100.00%) 1(100.00%) UnalignedSeqs 0(0.00%) 0(0.00%) [Bases] TotalBases 4641652 4617457 AlignedBases 4622650(99.59%) 4615685(99.96%) UnalignedBases 19002(0.41%) 1772(0.04%) [Alignments] 1-to-1 71 71 TotalLength 4639588 4625432 AvgLength 65346.31 65146.93 AvgIdentity 99.24 99.24
Important Parameter Considerations
On a single system, no parameter adjustment should be necessary. PBcR will detect your available resources and appropriately adjust parameters to fill the memory/CPUs on your system. The default options have been tested on a variety of genomes and systems and should work well for the majority of users. The options below are for advanced users who would like to optimize their assembly.
Assembly of Corrected Sequences
By default, PBcR will assemble the dataset with default parameters. The automated parameters have been tested on a variety of haploid or inbred eukaryotic genomes but need to modified for larger or diploid genomes. For these cases, you can set the assembly parameters in the spec file provided to PBcR as:
asmOvlErrorRate=0.10 asmUtgErrorRate=0.10 asmCnsErrorRate=0.10 asmCgwErrorRate=0.10 asmOBT=1 asmObtErrorRate=0.08 asmObtErrorLimit=4.5 utgGraphErrorRate=0.05 utgMergeErrorRate=0.05 ovlHashBits=24 ovlHashLoad=0.80
For most users, it should be unnecessary to run assembly by hand. However, if you want to run the assembly by hand read on. You can download an example spec file here which is tuned for a 16-core 64GB machine. You must adjust the memory and thread settings manually to match your system. To avoid automatically assembling your corrected sequences, you can specify:
In all cases, the longest 25X subset of the corrected data is recommended for assembly. This is because Celera Assembler, like many OLC assemblers, is sensitive to high coverage and noise, especially near the ends of sequences. To remove some of this noise, we recommend using the longest 25X of sequence to assemble. Given an E. coli K12 genome (4.65Mbp genome size), 25X coverage is approximately 116MBp. The commands below will subset an frg file to only 25X of the longest sequences.
% <wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore <your library name>.frg % <wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 116000000 asm.gkpStore > <your library name>.25X.frg
After you have your longest corrected sequences, you can manually run an assembly:
% <wgs>/<Linux-amd64>/bin/runCA unitigger=bogart \ utgGraphErrorRate=0.05 utgGraphErrorLimit=3.25 \ utgMergeErrorRate=0.05 utgMergeErrorLimit=5.25batOptions=“-NS -RS -CS” obtErrorRate=0.08 obtErrorLimit=4.5 <your library name>.25X.frg
If you have less than 10X of PacBio sequences and other FRG files for other data types (Illumina, 454, etc), you can also add them to the runCA command above.
You do not have to use Celera Assembler to assemble the corrected sequencing data. Any OLC assembler will work well with the corrected data. Falcon 0.1.3 is one example. If you have assembled the E. coli, you can re-assemble it using Falcon instead of CA. After you have followed the Falcon instructions and have it installed run:
$ mkdir falcon_asm $ cd falcon_asm $ ln -s ../K12.longest25.fasta in.fasta $ python falcon_overlap.py --min_len 3000 --n_core 16 --d_core 3 in.fasta > preads.ovlp $ python falcon_asm.py preads.ovlp in.fasta $ python falcon_fixasm.py
You will have an E. coli assembly in falcon_asm/primary_tigs_c.fa.
Running on a grid system
Currently, SGE, PBS/Torque, and LSF are supported and have been tested. It is possible the code will work on other grid systems but this is not tested/supported. There is external user support for SLURM. Each node in the grid must be a submit node. The default parameters assume SGE and default options for requesting memory/CPU cores. Since there is no standardized method for requesting memory/cores on SGE, you will likely need to to update these parameters. The minimum parameters that are needed to specify memory/threads are:
useGrid=1 scriptOnGrid=1 ovlCorrOnGrid=1 frgCorrOnGrid=1 ovlMemory = 32 ovlThreads = 16 ovlStoreMemory = 32000 threads = 16 cnsConcurrency = 8 gridOptionsCorrection=-pe threads 16 -l mem=10GB gridOptionsOverlap=-pe threads 16 -l mem=2GB gridOptionsMhap=-pe threads 16 -l mem=2GB gridOptionsConsensus=-pe threads 8 gridOptionsScript=-pe threads 16 -l mem=10GB gridOptionsFragmentCorrection=-pe threads 2 -l mem=4GB gridOptionsOverlapCorrection=-pe threads 1 -l mem=4GB
The parallel environment (-pe) varies from system to system so you may need to change threads to whatever is appropriate on your system. The memory parameter may also be different on your system.
Several steps run as large array jobs (Mhap, Overlap, Consensus) so you want to set the memory/threads parameters for those to what is available on the most common node on your system (to maximize throughput). For other jobs which run as a single multi-core job (Correction, Script) you can set them to utilize a high-memory machine available on your cluster. Make sure that whatever you request in gridOptions for Mhap/Overlap matches what you set for ovlMemory and ovlThreads, otherwise jobs may overload/underload your systems or not run at all. If you want to restrict the number of jobs running on your cluster simultaneously you can add the -tc option to the array jobs (Mhap, Overlap, Consensus) specifying the number of jobs that may be concurrently scheduled. For example, if you set -tc 20 with the above config, the run would be limited to 320 slots for Mhap (16slots per job, 20 jobs).
If you have LSF, you also must specify
as well as updating the gridOptions to the flags to the equivalents on an LSF system or
Low Coverage Assembly
By default, self-correction is performed with a probabilistic overlapping algorithm, named MHAP. The default parameters have been set to be a good compromise between speed and sensitivity. If you have low-quality data (< 85% identity) or have shorter or older chemistries, it may be beneficial to increase sensitivity. The sensitive parameters are automatically enabled below 50X coverage. You can increase sensitivity on any dataset by specifying
as a parameters to PBcR. This will increase the overlap sensitivity to over 85% and turn on PBDAGCON (if it is available).
If you have <30X coverage, you can further adjust parameters. At this coverage, there is not sufficient depth to generate >99% accuracy corrected reads. Thus, the parameters below decrease the average identity of the corrected sequences and increase the assembly error rate. You can add the following:
# lower QV of corrected sequences from 99+ to 97-98 QV=52 # increase assembly error rate asmOvlErrorRate=0.1 asmUtgErrorRate=0.06 asmCgwErrorRate=0.1 asmCnsErrorRate=0.1 asmOBT=1 asmObtErrorRate=0.08 asmObtErrorLimit=4.5 utgGraphErrorRate=0.05 utgMergeErrorRate=0.05
to your spec file if you would like to assemble <30X data. With these parameters, PBcR can close a bacterial genome (E. coli) with 20X P5 data and is comparable to hybrid tools at 20X on a eukaryotic genome (see above). These parameters for <30X assembly are experimental and have only been tried on E. coli, S. cerevisiae, and a 2.4Gbp eukaryote so your mileage may vary. PBDAGCON is also required for these coverage levels.
Controlling Consensus Module
PBcR supports two alternate consensus modules. The default is falcon_sense. An alternate is PBDAGCON, also utilized by HGAP. PBDAGCON must be installed on your system and available in your path. If you have SMRTportal installed and available in your path, you already have PBDAGCON. It can be beneficial below 60X coverage but in most cases, falcon_sense is sufficient for good assembly. PBDAGCON is automatically enabled as part of the sensitive mode above or can be enabled with
Therefore, it may be beneficial to run falcon_sense to generate an initial assembly rapidly and then run PBDAGCON if the assembly is not satisfactory. To generate an assembly using both methods for the phage example above run
% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l lambda -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000 -noclean
The -noclean option will prevent the pipeline from removing intermediate files. Once the assembly using falcon_sense is generated, you can see its stats:
% head -n 40 lambda/9-terminator/asm.qc [Scaffolds] TotalScaffolds=1 TotalContigsInScaffolds=1 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=49698 MeanBasesInScaffolds=49698 MinBasesInScaffolds=49698 MaxBasesInScaffolds=49698 N25ScaffoldBases=49698 N50ScaffoldBases=49698 N75ScaffoldBases=49698 TotalSpanOfScaffolds=49698 MeanSpanOfScaffolds=49698 MinScaffoldSpan=49698 MaxScaffoldSpan=49698 IntraScaffoldGaps=0 2KbScaffolds=1 2KbScaffoldSpan=49698 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,49698,49698,49698,0,7180000000002 total=1,49698,49698,49698,0 [Contigs] TotalContigsInScaffolds=1 TotalBasesInScaffolds=49698 TotalVarRecords=0 MeanContigLength=49698 MinContigLength=49698 MaxContigLength=49698 N25ContigBases=49698 N50ContigBases=49698 N75ContigBases=49698 [BigContigs_greater_10000]
In this case, the genome is assembled into one contigs. To try the assembly using PBDAGCON, move the old assembly
% mkdir first_asm % mv lambda* first_asm
Next, clean up the existing results
% rm templambda/[0-9]*.success % rm templambda/runPartition.sh
and finally re-launch the pipeline
% <wgs>/<Linux-amd64>/bin/PBcR -length 500 -partitions 200 -l lambda -s pacbio.spec -fastq pacbio.filtered_subreads.fastq genomeSize=50000 pbcns=1
PBcR will skip all steps before consensus and generate a new assembly.
% head -n 40 lambda/9-terminator/asm.qc -bash-3.2$ head -n 40 lambda/9-terminator/asm.qc [Scaffolds] TotalScaffolds=1 TotalContigsInScaffolds=1 MeanContigsPerScaffold=1.00 MinContigsPerScaffold=1 MaxContigsPerScaffold=1 TotalBasesInScaffolds=49508 MeanBasesInScaffolds=49508 MinBasesInScaffolds=49508 MaxBasesInScaffolds=49508 N25ScaffoldBases=49508 N50ScaffoldBases=49508 N75ScaffoldBases=49508 TotalSpanOfScaffolds=49508 MeanSpanOfScaffolds=49508 MinScaffoldSpan=49508 MaxScaffoldSpan=49508 IntraScaffoldGaps=0 2KbScaffolds=1 2KbScaffoldSpan=49508 MeanSequenceGapLength=0 [Top5Scaffolds=contigs,size,span,avgContig,avgGap,EUID] 0=1,49508,49508,49508,0,7180000000002 total=1,49508,49508,49508,0 [Contigs] TotalContigsInScaffolds=1 TotalBasesInScaffolds=49508 TotalVarRecords=0 MeanContigLength=49508 MinContigLength=49508 MaxContigLength=49508 N25ContigBases=49508 N50ContigBases=49508 N75ContigBases=49508 [BigContigs_greater_10000]
The phage is successfully assembled into a single contig by both methods.
Celera Assembler Known Issues
Correcting Large (> 100Mbp) Genomes
PBcR supports a novel overlapping module for efficiently identifying all-pairs overlaps in high-error long-read data. CA 8.3 is recommended for large genomes sequenced with PacBio data and using self-correction. If you are using hybrid correction (correcting PacBio with other data), then follow the instructions below for CA 8.1. MHAP has been tested on S. cerevisiae, A. thaliana, and D. melanogaster. To use the pipeline 50X of PacBio RS sequencing data is recommended. If you have less coverage than this, you can still use PBcR and increase the sensitivity at the expense of runtime.
PBcR will auto-detect available memory and cores if you are running on a single system. You can limit these the same way as the example spec file from the E. coli by specifying
ovlMemory = <GB per job>
More memory will speed up computation and create fewer partitions, less memory will create more partitions. When running on a grid, you should specify the memory available on each node being utilized (see above). A single machine is sufficient for medium-size eukaryotes (<300Mbp) and PBcR should complete in a few days, however, a cluster is recommended for mammalian-sized genomes. A local scratch disk is recommended for best performance on the grid.
Additionally, CA 8.2 is available on AWS with instruction and large genome examples
Correcting Large (> 100Mbp) Genomes (Using high-identity data or CA 8.1)
maxGap=1500 ovlHashBlockLength=1000000000 ovlRefBlockLength=1000000000
If you are using high-identity sequences for correction, add the parameter
blasr=-noRefineAlign -advanceHalf -noSplitSubreads -minMatch 10 -minPctIdentity 70 -bestn 24 -nCandidates 24
For self correction, add the parameter
blasr=-minReadLength 200 -maxScore -1000 -maxLCPLength 16 -bestn 24 -nCandidates 24
When using BLASR, each overlap job should take less than 8GB of memory. Thus, the limitation is usually the number of cores on the system, not the memory. You can set ovlThreads=16 and ovlConcurrency to be:
<total cores> / <ovlThreads>
Correcting Large (> 100Mbp) Genomes With CA 7.0 or older (not recommended)
ovlHashBits = 25 ovlHashBlockLength = 1000000000 ovlRefBlockSize = 100000000 doOverlapBasedTrimming = 0
This will make each overlapper job use approximately 18GB of RAM (up from the default of 6GB of ram). You should set the ovlConcurrency to be:
<total memory> / 18
For example, for a 256GB machine, ovlConcurrency would be 14. The ovlThreads parameter should be set to:
<total cores> / ovlConcurrency
So for a 256GB machine with 32 cores, ovlThreads would be 2 (assuming ovlConcurrency was set to 14 above). When using an SGE grid, the sgeOverlap parameter should also be set to request the appropriate threads and memory:
sgeOverlap = -pe threads 2 -l mem=10GB
For full details on tuning these parameters, please see the Celera Assembler runCA wiki page.
If you already have a correction running, you can tell the number of overlap jobs created by running
% wc -l temp<your library name>/0-overlaptrim-overlapper/ovlopt
If the number of overlap jobs is large (over 400), then the parameters above are recommended over the default settings.
No Overlaps Found
The PacBio SMRT-portal version 1.3.0 outputs a non-standard fastq file from their filtering pipeline. This causes the PBcR pipeline to either stop with an error or output no corrected sequences. You can check if your fastq file is the cause of the problem if it splits the sequence and quality on multiple lines:
@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG TCANCCCTGCATTATAAGAAGTACCGCAGGCGACG ATCTGTACATGTTCCACTTTTGCCAGAATCTCTGCC + EEE#EGGGGGIGIDIGHIIEIHIIIIIGDIIIGGFHHIIIGIIIIFID HHIGHIHHGHIHIGHGIGHGGIC @HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG TAANGGGCTGACAACGCATTACACAAAACTCAAAC ACAACAACCGTAACCACCGCGGTTCAATGGGACTG G + EEE#EGGGGGIIIIIIHIIIIIHHIIIHIIIIIIIIDIIIGGGGGDGGG GIEIIHIDHIIHIFHEGGHIF=
versus the expected format of:
@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG TCANCCCTGCATTATAAGAAGTACCGCAGGCGACGATCTGTACATGTTCCACTTTTGCCAGAATCTCTGCC + EEE#EGGGGGIGIDIGHIIEIHIIIIIGDIIIGGFHHIIIGIIIIFIDHHIGHIHHGHIHIGHGIGHGGIC @HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG TAANGGGCTGACAACGCATTACACAAAACTCAAACACAACAACCGTAACCACCGCGGTTCAATGGGACTGG + EEE#EGGGGGIIIIIIHIIIIIHHIIIHIIIIIIIIDIIIGGGGGDGGGGIEIIHIDHIIHIFHEGGHIF=
As a workaround, you can use the fasta file and convert it to fastq following the instructions above. Alternatively, you can use a perl script to convert non-standard fastq files to ones PBcR expects. Thank you to Lee Katz for providing the script. To convert a non-standard fastq file named nonstandard.fastq to the correct format run the command:
% perl run_assembly_convertMultiFastqToStandard.pl -i nonstandard.fastq -o standard.fastq
Run the script with no parameters to get help.
Berlin K, Koren S, Chin CS, Drake PJ, Landolin JM, Phillippy AM Assembling Large Genomes with Single-Molecule Sequencing and Locality Sensitive Hashing. Nature Biotechnology. (2015).
Koren S, Harhay GP, Smith TPL, Bono JL, Harhay DM, Mcvey SD, Radune D, Bergman NH, Phillippy AM. Reducing assembly complexity of microbial genomes with single-molecule sequencing. Genome Biology 14:R101 (2013)
Koren S, Schatz, M. C., Walenz, B. P., Martin, J., Howard, J. T., Ganapathy, G., Wang, Z., Rasko, D. A., McCombie, W. R., Jarvis, E. D., and Phillippy, A. M. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nature Biotech. (2012)
If you encounter issues or have questions, please contact the authors of the pipeline, Sergey Koren (sergek AT umd.edu) or Adam M. Phillippy (aphillippy AT gmail.com)