PBcR 81

From wgs-assembler
Jump to: navigation, search


The PBcR pipeline is available in CA 7.0 or newer, however CA 8.1 is recommended. The self-correction feature, requires Celera Assembler 8.1.

usage:pacBioToCA [options] -s spec.file fastqFile=fastqfile [frg]
  -length        Minimum length of PacBio RS fragment to keep.
  -partitions    Number of partitions for consensus.
  -l libraryname Name of the library; freeformat text.
  -t threads     Number of threads to use for correction.
  -noclean       Do not clean up after the correction finishes. By default, the temp<libraryname> directory will be removed on successful completion of the correction.
  -sge           Optional parameters to pass through to qsub for SGE. If your spec file specifies an sge line, it will be used by default.
  -sgeCorrection Optional parameters to pass through when running the correction step to qsub for SGE. If your spec file specifies an sgeScript line, it will be used by default.

The pipeline will corrected the sequences and output a CA FRG file consisting of a single LIB message to <libraryname>.frg. It will also output the corrected fasta, qual, and fastq sequences (named <libraryname>.fasta, <libraryname>.qual, and <libraryname>.fastq).

In addition, the pipeline will also output <libraryname>.log file which indicates the source for each corrected sequence. The format of the file is:

% head <libraryname>.log
INPUT_NAME	                                                                OUTPUT_NAME	  SUBREAD	 START      END	       LENGTH
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8962/3861_17595	pacbio_3307_1	    1	          53	   13682	13629
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8965/6675_14278	pacbio_3309_1	    1	          51	    7551	 7500
m130404_014004_sidney_c100506902550000001823076808221337_s1_p0/8977/0_17507	pacbio_3316_1	    1	          63	   17456	17393

The first column is the input name of the PacBio RS sequencing read which was corrected. The second column is the name of the output corrected sequence. The third column indicates if the input PacBio RS sequence was split into more than a single fragment. This can happen due to SMRTbell adapters or issues correcting a subset of the long sequence. The next columns indicate the positions in the original PacBio RS sequence where the corrected read came from. The final column is the length of the corrected sequence. The <libraryname>.correction.* and <libraryname>.err files may also get created based on whether the pipeline runs on the grid or a single machine. These provide debugging output. If you encounter issues with the pipeline or have questions, please include these *.err files with your email.

The -libraryname option is mandatory. It provides a UID name for the library that the reads are placed into. This name is completely up to the end user, but must contain no white spaces or commas.

The -s pacbio.spec option is mandatory. It should point to a a CA spec file to specify how to overlap the illumina and PacBio data for correction (which comes included with the sample data.) Two templates are provided for SGE and high-memory multi-core environment.

The fastqFile option is mandatory. It specifies the pacbio fragments to correct. The correction will also work with unfiltered reads if you don't have a filtered fastq file.

The options (-length, -partitions, and -t, -noclean, -sge, and -sgeCorrection) are optional.

The -t X option specifies that the correction program itself is run using X threads. It should be set to the number of processors available wherever this pipeline runs.

You can specify an arbitrary number of FRG files to be used as the trusted sequences for correction. If you have 50X+ coverage of C2 sequencing data (or newer) you can also perform self-correction. This option is automatically enabled if you omit the FRG files. Input FRG files can be any format CA accepts. Please see the FastqToCA and SffToCA pages for information on inputting Illumina or 454 data. Generally, you don't need more than about 50X of high-identity sequences for correction.

Inputting PacBio RS Sequences

The PBcR pipeline expects PacBio RS sequences in fastq format (with sanger (PHRED32) quality values). The pbh5tools package from Pacific Biosciences converts h5 files to fasta or fastq files. The currently recommended parameters for correction are quality = 0.75 and minimum length = 100.

If you already have fasta files, you can download a Java utility to convert the fasta files into PBcR compatible fastq files:

java -jar convertFastaAndQualToFastq.jar myfile.fasta > myfile.fastq

NOTE: If you are using the SMRT-portal version 1.3, please see the known issues if you encounter errors or no corrected sequences are output.

Self-Correction With C2 Sequences (or newer)

Starting with the C2 chemistry (and the more recent chemistries), you can self-correct PacBio RS sequences without any high-identity technology. Using this pipeline requires Celera Assembler 8.1. You must also have SMRTportal installed and available in your path. SMRTportal is freely available from the PacBio RS DevNet website. Approximately 100X (or more) of C2 sequencing is ideal. If you have less than 50X PacBio RS sequencing, then follow the instructions for correction with complementary high-identity sequences.

To run the pipeline, follow the usage instructions below.

Self-correction Example with E. coli Reads

This example is applicable for sufficient coverage of C2 or newer sequencing chemistry sequences. In this case, self-correction can be used, eliminating the requirement for complementary high-accuracy short-read sequences. Download the sample data from http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz which contains the Pacific Biosciences release of 100x of P4-C2 chemistry reads after filtering with SMRTportal. You can download the original data directly (6.35 GB).

% tar xvzf selfSampleData.tar.gz
x selfSampleData/
x selfSampleData/asm.spec
x selfSampleData/ecoli.fasta
x selfSampleData/pacbio_filtered.fastq.bz2
x selfSampleData/pacbio.spec
x selfSampleData/README
x selfSampleData/asm.SGE.spec
x selfSampleData/pacbio.SGE.spec

% ls selfSampleData/
README                         instructions on running the correction
asm.spec                       CA spec file for assembly. Assumes a high-memory multi-core machine. 
asm.SGE.spec                   CA spec file for assembly. Assumes an SGE environment.
ecoli.fasta                    The reference E. coli K-12 MG1655 sequence
pacbio_filtered.fastq.bz2      The PacBio RS sequences (with high-error) after filtering through the secondary pipeline (approximately 100X coverage).
pacbio.spec                    CA spec file to be used for correction. Assumes a high-memory multi-core machine.
pacbio.SGE.spec                CA spec file to be used for correction. Assumes an SGE environment.

Correct PacBio RS single-pass sequences

Run the correction to generate corrected pacbio sequences.

% <wgs>/<Linux-amd64>/bin/pacBioToCA -pbCNS -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq pacbio_filtered.fastq.bz2 genomeSize=4650000 > run.out 2>&1

Assemble corrected sequences

After the correction pipeline runs, you will have a pacbio.frg file in the selfSampleData directory. There is also be pacbio.fasta and a pacbio.qual files. You can use the pacbio.frg file for a CA assembly. Note, if you have high-coverage PacBio RS data, we recommend using only 25X of the longest post-correction sequences for assembly for best results. First, we can estimate the coverage and average corrected read size

% cat pacbio.log |sort -nk6 |awk '{SUM+=$NF; TOTAL++; } END {print SUM/4650000" "SUM/TOTAL}'

The average read size should be approximately 8,100bp and 38X coverage. This estimate is pre-quality trimming so the actual corrected read length and coverage will be slightly shorter (80-90% of estimate). Since we have over 25X, we will select the longest 25X. Given a genome size of 4.65Mbp, we need 116Mbp in PBcR sequences.

% <wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore pacbio.frg
% <wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 116000000 asm.gkpStore > pacbio.25X.frg

Finally, we will assemble the longest 25X of corrected data. The provided asm.spec includes the recommended parameters for assembling PacBio data. More details are available in the assembly of corrected sequences section. Since we have long sequences (over 6.5Kbp average), we can use a large minimum overlap length. We will also include options to speed up the consensus step. This will use only uncontained sequences to calculate consensus. While slightly lower quality (99.96% identity to the reference versus 99.99%), it runs in 4 hours versus 18 (4X faster). After assembly, polishing with Quiver is recommended to get the final consensus sequence.

% <wgs>/<Linux-amd64>/bin/runCA -p asm -d asm -s asm.spec ovlMinLen=3000 cnsReuseUnitigs=1 "cnsReduceUnitigs=0 0" pacbio.25X.frg > asm.out 2>&1

Substitute your CA location for <wgs> and your machine type for <Linux-amd64>. After this you should have an assembled genome in asm/9-terminator/asm.ctg.fasta.

% head -n 50 asm/9-terminator/asm.qc






The assembly is comprised of one long contig (the full E. coli chromosome) with a 2.5Kbp overlap since the chromosome is circular. We can check the number of sequences within each contig:

% cat asm/9-terminator/asm.posmap.frgctg |awk '{print $2}' |sort |uniq -c
  13386 7180000000002
      5 7180000000003

The first column gives the sequence count in each contig. Generally, contigs comprised of a few sequences (less than 50) can safely be ignored in downstream analysis.

Correction Using Complementary High-Identity Sequences

Inputting High-Identity Sequences

The below example is applicable if you have less than 50X coverage of PacBio RS data or you have C1 or pre-release C2 chemistry. In this case, high-identity sequences are required to correct PacBio RS sequence. You can correct the PacBio RS sequences using any high-identity next-generation technology. This includes Illumina, 454, and PacBio CCS sequences. The PBcR pipeline accepts an arbitrary number of frg files at the end of the command.

  • For Illumina data, use the fastqToCA command. For examples, see the phage example below and the usage page.
  • For 454 data, use sffToCA.
  • For PacBio RS CCS sequences, use the fastqToCA command. For examples, see the usage page.

Download the phage sample data available http://www.cbcb.umd.edu/software/PBcR/data/sampleData.tar.gz

% tar xvzf sampleData.tar.gz
x sampleData/
x sampleData/asm.spec
x sampleData/illumina.fastq
x sampleData/lambda.fasta
x sampleData/pacbio.filtered_subreads.fasta
x sampleData/pacbio.spec
x sampleData/README
x sampleData/asm.SGE.spec
x sampleData/pacbio.SGE.spec

% ls sampleData/
README                         instructions on running the correction
asm.spec                       CA spec file for assembly. Assumes a high-memory multi-core machine. 
asm.SGE.spec                   CA spec file for assembly. Assumes an SGE environment.
illumina.fastq                 Illumina unpaired sequences to be used for correction
lambda.fasta                   The reference phage sequence
pacbio.filtered_subreads.fasta The PacBio RS sequences (with high-error) after filtering through the secondary pipeline.
pacbio.spec                    CA spec file to be used for correction. Assumes a high-memory multi-core machine.
pacbio.SGE.spec                CA spec file to be used for correction. Assumes an SGE environment.

Processing complementary data

First, convert illumina.fastq to CA frg format:

% cd sampleData/
% <wgs>/<Linux-amd64>/bin/fastqToCA -libraryname illumina -technology illumina -type sanger -innie -reads illumina.fastq > illumina.frg

Correct PacBio RS single-pass sequences

Next, run the correction to generate corrected pacbio sequences. This example uses the Java utility above to convert the PacBio RS fasta file to fastq.

% java -jar convertFastaAndQualToFastq.jar pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.fastq
% <wgs>/<Linux-amd64>/bin/pacBioToCA -length 500 -partitions 200 -l pacbio -t 16 -s pacbio.spec -fastq pacbio.filtered_subreads.fastq illumina.frg > run.out 2>&1

Assemble corrected sequences

After the correction pipeline runs, you will have a pacbio.frg file in the sampleData directory. There is also be pacbio.fasta and a pacbio.qual files. You can use the pacbio.frg file for a CA assembly. Note, if you have high-coverage PacBio RS data (over 50X), we recommend using 25X of the longest post-correction sequences for assembly for best results.

To subset the longest 25X of sequences (only supported when CA is built from source). Given a genome size of 50Kbp, we need 1Mbp in PBcR sequences.

% <wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore  pacbio.frg 
% <wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 1000000 asm.gkpStore > pacbio.25X.frg

Assemble the data

% <wgs>/<Linux-amd64>/bin/runCA -p asm -d asm -s asm.spec pacbio.25X.frg > asm.out 2>&1

Substitute your CA location for <wgs> and your machine type for <Linux-amd64>. After this you should have an assembled phage in asm/9-terminator/asm.ctg.fasta.

% head -n 40 asm/9-terminator/asm.qc






Illumina-only assembly comparison

For comparison, you can assemble the Illumina data alone:

% <wgs>/<Linux-amd64>/bin/runCA -p asm -d illuminaASM -s asm.spec illumina.frg > asm.out 2>&1
% head -n 40 illuminaASM/9-terminator/asm.qc





Assembly of Corrected Sequences

Celera Assembler Parameters

Celera Assembler, like many OLC assemblers, is sensitive to high coverage and noise, especially near the ends of sequences. To remove some of this noise, we recommend using the longest 25X of sequence to assemble. Given an E. coli K12 genome (4.65Mbp genome size), 25X coverage is approximately 116MBp. The commands below will subset an frg file to only 25X of the longest sequences.

% <wgs>/<Linux-amd64>/bin/gatekeeper -T -F -o asm.gkpStore pacbio.frg
% <wgs>/<Linux-amd64>/bin/gatekeeper -dumpfrg -longestlength 0 116000000 asm.gkpStore > pacbio.25X.frg

Finally, it is recommended that some Celera Assembler parameters be adjusted to make overlap detection more stringent and eliminate false overlaps when assembling only PacBio corrected long sequences. The <ovl value> parameter should be set to approximately 40% of your average corrected sequence lengths. As a general rule, if the average corrected length is less than 2.5Kbp, set it to 1000, if it is less than 3Kbp, set it to 1500, if it is less than 5.5Kbp, set it to 2000, if it is greater than 5.5Kbp, set it to 2500, and if it is greater than 6.5Kbp, set it to 3000. The currently recommended parameters are

% <wgs>/<Linux-amd64>/bin/runCA unitigger=bogart merSize=14 ovlMinLen=<ovl value> \
                           obtErrorRate=0.03 obtErrorLimit=4.5 ovlErrorRate=0.03 \
                           utgErrorRate=0.015 utgGraphErrorRate=0.015 utgGraphErrorLimit=0 \
                           utgMergeErrorRate=0.03 utgMergeErrorLimit=0

It may also be beneficial to run the assembly with default parameters (which also allow combining PacBio corrected sequences with other technologies):

% <wgs>/<Linux-amd64>/bin/runCA unitigger=bogart merSize=14 

Finally, if you are working with a diploid genome and would like to merge haplotypes, you can decrease the overlap stringency to allow the assembler to merge more regions:

% <wgs>/<Linux-amd64>/bin/runCA unitigger=bogart merSize=14 \
                          utgGraphErrorRate=0.07 utgGraphErrorLimit=3.25 \
                          utgMergeErrorRate=0.085 utgMergeErrorLimit=5.25 doOverlapBasedTrimming=0

Polishing Consensus

The final assembly consensus is normally QV45 of higher. It can be improved by post-processing using the Quiver consenus algorithm (the default for the RS_Resequencing protocol in SMRT®Portal 1.4). Import the assembly into the SMRTPortal and run the RS_Resequencing protocol to generate a final polished assembly.

Additional Options

Replacing Celera Assembler Overlapper

Instead of using CA's built-in overlapper, it is possible to use BLASR, Bowtie 2, or an arbitrary aligner to find overlaps between PacBio RS sequences and high-identity sequences in CA 8.1 or when CA is built from source. Note that BLASR is automatically enabled if you are using self-correction.

To enable BLASR add:

"blasr=-advanceHalf -noRefineAlign -ignoreQuality -minMatch 10 -maxLCPLength 15 -minPctIdentity 70.0"

to the spec file or as a parameter.

To enable Bowtie 2, specify

"bowtie = -N 0 -L 10 --rfg 2,2 --rdg 2,2 -R 3 -i S,1,1.25"

to the spec file or as a parameter.

To use an arbitrary aligner, generate alignments with PacBio RS sequences as the reference and high-identity sequences as the reads in SAM format. Then, given sam files named 001.sam, 002.sam, 003.sam,

% find -L `pwd` -name \*.sam -print | sort > fofn
% echo "samFOFN=fofn" >> pacbio.spec

Filling Coverage Gaps With PacBio RS Sequences

In older versions of PBcR, when there was a coverage gap in the short-read coverage, the PacBio RS sequence was split. This caused issues when the short-read data had coverage gaps due to sequencing bias. In the latest version available in CA 8.0 or newer can be used to recruit PacBio RS sequences to coverage gaps. The corrected sequences still have a high-quality consensus above 99% accuracy. On a test run of E. coli K-12, enabling the option increased the mean corrected sequence length to 4,187 from 2,493 and improved the assembly N50 to 4.65Mbp from 3.32Mbp. To enable this option specify


in the spec file or as a parameter.

Known Issues

Celera Assembler Known Issues

For the latest issues found in the Celera Assembler release, please see known issues page. You can also check the Celera Assembler bugs page for any reported issues.

Correcting Large (> 100Mbp) Genomes

PBcR and Celera Assembler require adjusted parameters for optimal performance on large genomes. The default parameters in the pacbio.spec file are configured to fully utilize a machine with 16 cores and 48GB of RAM. For large genomes (i.e. 1GB), an SGE grid or a high-memory machine (256GB) of RAM are recommended for optimal performance.

Correcting Large (> 100Mbp) Genomes With CA 8.1

PBcR has beta support for using BLASR when correcting large genomes starting in CA 8.1 .


If you are using high-identity sequences for correction, add the parameter

blasr=-noRefineAlign -advanceHalf -noSplitSubreads -minMatch 10 -minPctIdentity 70 -bestn 24 -nCandidates 24

For self correction, add the parameter

blasr=-minReadLength 200 -maxScore -1000 -maxLCPLength 16 -bestn 24 -nCandidates 24


When using BLASR, each overlap job should take less than 8GB of memory. Thus, the limitation is usually the number of cores on the system, not the memory. You can set ovlThreads=16 and ovlConcurrency to be:

<total cores> / <ovlThreads>

Correcting Large (> 100Mbp) Genomes With CA 7.0 or older

ovlHashBits = 25
ovlHashBlockLength = 1000000000
ovlRefBlockSize =  100000000
doOverlapBasedTrimming = 0

This will make each overlapper job use approximately 18GB of RAM (up from the default of 6GB of ram). You should set the ovlConcurrency to be:

<total memory> / 18

For example, for a 256GB machine, ovlConcurrency would be 14. The ovlThreads parameter should be set to:

<total cores> / ovlConcurrency

So for a 256GB machine with 32 cores, ovlThreads would be 2 (assuming ovlConcurrency was set to 14 above). When using an SGE grid, the sgeOverlap parameter should also be set to request the appropriate threads and memory:

sgeOverlap = -pe threads 2 -l mem=10GB

For full details on tuning these parameters, please see the Celera Assembler runCA wiki page.

If you already have a correction running, you can tell the number of overlap jobs created by running

% wc -l temp<your library name>/0-overlaptrim-overlapper/ovlopt

If the number of overlap jobs is large (over 400), then the parameters above are recommended over the default settings.

Short-read sequences below 64bp

If you have short sequences for correction (< 64bp), the pipeline will by default not load them and will give a no-overlaps error instead of outputting corrected sequences. In order to use short Illumina sequences, you have to specify the
option to the pipeline. Also, add the following parameters at the end of the pacbio.spec file
frgMinLen = 40 # smaller than your read length
ovlMinLen = 30 # about 75% of your frgMinLen
merSize = 10

This is currently only recommended for small (<10Mbp) genomes as it will increase runtime.

If you have Illumina sequences longer than 64bp but shorter than 100bp, have to specify the
option to the pipeline along with the following parameter to pacbio.spec:
merSize = 10

No Overlaps Found

The PacBio SMRT-portal version 1.3.0 outputs a non-standard fastq file from their filtering pipeline. This causes the PBcR pipeline to either stop with an error or output no corrected sequences. You can check if your fastq file is the cause of the problem if it splits the sequence and quality on multiple lines:

@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG
@HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG

versus the expected format of:

@HWUSI-EAS109E:34:64G34AAXX:7:2:15068:1426 1:N:0:GAGTGG
@HWUSI-EAS109E:34:64G34AAXX:7:2:4347:2391 1:N:0:GAGTGG

As a workaround, you can use the fasta file and convert it to fastq following the instructions above. Alternatively, you can use a perl script to convert non-standard fastq files to ones PBcR expects. Thank you to Lee Katz for providing the script. To convert a non-standard fastq file named nonstandard.fastq to the correct format run the command:

% perl run_assembly_convertMultiFastqToStandard.pl -i nonstandard.fastq -o standard.fastq

Run the script with no parameters to get help.