From wgs-assembler
Jump to: navigation, search

The sffToCA utility reads SFF files generated by 454 Life Sciences sequencers. sffToCA converts each read from the SFF format into the CA FRG format, optionally examining the read for the presence of a 'linker' sequence. If the linker sequence is found, the original read is split into two mated reads. The SFF format does not contain information about the insert size, and so this must be supplied manually during conversion.


usage: sffToCA [opts] -libraryname LIB -output NAME IN.SFF ...

  -insertsize i d        Mates are on average i +- d bp apart.

  -libraryname n         The UID of the library these reads are added to.

  -clear all             Use the whole read.
  -clear 454             Use the 454 clear ranges as is (default).
  -clear n               Use the whole read up to the first N.
  -clear pair-of-n       Use the whole read up to the frist pair of Ns.
  -clear discard-n       Delete the read if there is an N in the clear range.

  If multiple -clear options are supplied, the intersection is used.  For
  'discard-n', the clear range is first computed, then if there is still an
  N in the clear range, the read is deleted.

  Caution!  Even though the default is '454', when any -clear option is used,
  the list of clear ranges to intersect is reset.  To get both '454' and 'n',
  BOTH '-clear 454' and '-clear n' must be supplied on the command line.

  -trim none             Use the whole read regardless of -clear settings.
  -trim soft             OBT and ECR can increase the clear range.
  -trim hard             OBT can only shrink the clear range, but ECR can extend (default).
  -trim chop             Erase sequence outside the clear range.

  'none' will emit the whole read, and reset clear ranges to cover the whole read.
  'soft' will emit the whole read, and leave clear ranges as set.
  'hard' is like soft, with the addition of a 'clm' message to stop OBT.
  'chop' is like none, but after the read is chopped down to just the clear bases.

  -linker [name | seq]   Search for linker, create mated reads.
                         Name is one of:
                           'titanium' == TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG and

  -nodedup               Do not remove reads that are a perfect prefix of another read.

  -output name           Write output to files prefixed with 'name'.  Three files are created:
                           name.frg   -- CA format fragments.
                           name.log   -- Actions taken; deleted fragments, mate splits, etc.
                           name.stats -- Human-readable statistics.

Every read in the input must be associated with a sequencing library. This library describes properties about the read, for example, the expected distance between two mated reads. The -libraryname option sets the name of the library that all reads generated by sffToCA will be associated with. It is a mandatory option.

The -output required parameter sets the output filename. Users should append the '.frg' extension, but if not, sffToCA will append one anyway.

The -libraryname required parameter defines an identifier for the library. Generally, each run of sffToCA should provide a distinct library name. Library name may be any arbitrary string, but meaningful ones could be helpful later. Each read will be associated with its library ID throughout the Celera Assembler (CA) pipeline. Here are examples of how. It is possible to dump reads by library from gkpStore using 'gatekeeper'. It is possible to analyze libraries for randomness, comparing inter-library overlaps to intra-library overlaps, using 'overlapStore'. It is possible to analyze scaffolds by library incorporation using combinations of CA output files.

Users face these four decisions when converting SFF files to FRG files:

  1. what sequence should be included in the fragment clear range? (the -clear option)
  2. what to do with sequence not in the clear range? (the -trim option)
  3. what to do about duplicate sequences? (the -nodedup option)
  4. what linker is separating mated reads? (the -insertsize and -linker options)

Clear Ranges and Trimming

sffToCA can determine the initial clear range of each read in a variety of ways. We have experimented with these five options ('all', '454', 'n', 'pair-of-n', and 'discard-n') and decided that using the '454' is best. If you have LOTS of sequence data, 'discard-n' will remove many of the questionable and/or low quality reads, which might improve assemblies. If reads with N are left in the data set, the Overlap Based Trimming module will remove any low quality ends.

The two recent 454 technologies, FLX and Titanium, generate reads with significantly different characteristics. FLX reads tend to be good over the entire read (clear 'all' can be used), and only a few reads have an N in them and these reads are usually problematic ('discard-n' can be used). Titanium reads are much longer, the 3' end of the read is of much lower quality (clear '454' should be used), and there are many more N's in the reads ('discard-n' should NOT be used).

Once the clear range is determined, there are several choices of what to do with it. The clear range can be completely ignored ('none'), it can be used but allowed to change ('soft'), it can be used but not allowed to change ('hard'), or it can be used to immediately trim bases from the read.

At present, we suggest using "-clear 454 -trim chop".

NOTE! For mate-pair reads, the use of "-trim chop" is STRONGLY SUGGESTED.


Duplicate reads are a common problem with 454 technology. Duplicates can be harmful for assembly. Therefore, by default, the SFF conversion utility will remove duplicate reads. The algorithm removes any read that is a perfect prefix of some other read. For example, it would remove "ACCT" if it found another read, "ACCTGGCT". The other read can be from any of the SFF files provided on this run of the utility.

Use the -nodedup option if the data was already processed to remove duplicate reads. There are 3rd party tools for this, including:

  1. The Schmidt Lab utility
  2. cdhit-454.

Apart from sffToCA, Celera Assembler (CA) has its own duplicate removal algorithm. CA will use overlaps to detect and remove reads that are a near-perfect prefix of any other read in the same library (as determined by the library tag supplied to sffToCA). CA will also use overlaps to detect and remove mate pairs whose reads form perfect prefixes of others, starting at the end of the read that was ligated to linker during library construction. Thus, even if the sffToCA -nodedup option is used, some duplicates may be removed by Celera Assembler.

One cause of duplicates is failure of the emulsion PCR step to match one sequence to one bead. Since this happens during library construction before sequencing, it affects all subsequent sequencing runs. Duplicates from this mechanism are spread across all SFF files derived from the same library. Thus, users should invoke sffToCA once for each library, giving all the SFF files from that library on each invocation. This will let sffToCA find duplicates across SFF files. During assembly, duplicates of this type violate the assumption of uniform genome coverage. By "piling up", they can induce high coverage on unique unitigs. The assembler may distrust these unitigs, suspecting they are collapsed repeats. Thus, the effect of duplicate reads is missed opportunity and shattered assembly.

Another cause of duplicates is insufficient starting material. This is commonly the case for large-insert libraries, such as 20Kbp paired ends. With insufficient starting material, PCR copies of the same target are sequenced multiple times. This leads to duplicate mates whose read lengths differ but whose linker-ligation points match exactly. During assembly, duplicates of this type violate the assumption of independence of mate pair evidence. The assembler will "trust" each mate pair because it seems to be confirmed by others. Thus, the danger of duplicate mates is a biased result that satisfies duplicated mate constraints more often than non-duplicated mates. Duplicate mates can also induce the same high coverage problems noted above.

Linker Sequences

Unlike Sanger and Illumina reads, mate-pair reads from 454 appear in the output as a single sequence. The two mated reads are separated by a known 'linker' sequence. To discover the mated reads, the linker sequence must be bioinformatically removed. sffToCA uses a Smith-Waterman dynamic programming alignment to match the known linker sequence(s) to each read.

  • For each read with alignment to linker, if there is exactly one high-quality alignment to that linker, the original SFF read is split into two mated reads; the shorter reads are output as separate "FRG" message plus an "LKG" message that links them.
  • For each read with partial alignment to linker near the read end, the output FRG message has a reduced clear range that excludes the aligned region.
  • For each read with low-quality alignment to linker, the output FRG message carries an attribute that marks the the aligned region as suspected contaminant. Contamination regions are scrutinized during the chimer detection and overlap-based trimming steps of Overlap Based Trimming.

The built-in linker sequences are:

  1. -linker flx -- GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC, a palindrome, equal to its own reverse complement.

If any -linker option is supplied, the -insertsize must also be supplied.

Multiple -linker options are allowed, for example, if you don't know if your fragments are using the flx or titanium linker, request that both be searched for with "-linker flx -linker titanium".

Some labs use non-standard linker sequences. sffToCA can search for up to 50 custom linkers, each provided as a separate -linker <sequence> option. If different, supply both the forward and reverse complement as if they were separate linkers. (CAUTION: Custom linker support was enhanced April 15, 2010. Use of 50 linkers has never actually been tested. Alignment thresholds may not be well-tuned for custom linkers much longer or shorter than 42bp.)


  • sffToCA can process gzip (.gz) and bzip2 (.bz2) compressed SFF files.


Convert one full FLX run of mated reads to FRG foramt.

sffToCA \
  -libraryname FJRUAFO \
  -insertsize 3200 900 \
  -linker flx \
  -trim chop \
  -output FJRUAFO.frg \
  FJRUAFO01.sff.bz2 \

sffToCA generated output files 'FJRUAFO.frg' and 'FJRUAFO.stats'. The stats file details how many duplicates were found, how many mate pairs were found, etc.

input sff               FJRUAFO01.sff
input sff               FJRUAFO02.sff
output fragments        FJRUAFO.frg
clear range             454
trimming                chop

numReadsInSFF           494837

too short               550
ok                      494287
trimmed by N            0
too long                0

not examined            12496
none detected           255637
inconsistent            10290
partial                 75538
good                    140876

fragment                331175
mate pair               140876
deleted inconsistent    10290
deleted duplicate       11946
deleted too short       550
deleted N not allowed   0

Please see this bug report about the number of linker sequences reported in the stats file.