From wgs-assembler
Jump to: navigation, search

These are the requirements for successful use of Celera Assembler.

User Requirements

  • Working knowledge of the Unix operating system.
  • Understanding of DNA sequence analysis by whole-genome-shotgun sequencing.

Scientific Applicability

  • Celera Assembler expects paired-end, whole-genome shotgun (WGS) sequence from genomic DNA.
    • Celera Assembler can also run on metagenomics shotgun DNA sequence.
    • With special options, Celera Assembler can also run on finishing reads mixed with shotgun reads.
    • Celera Assembler recognizes and handles subsets of reads with markedly different coverage. This includes high-copy plastids (mitochondria, chloroplast) DNA in eukaryotes and high-copy plasmids in prokaryotes. We advise against trying to screen 'contaminants' from the WGS data.
  • Celera Assembler expects data from high-throughput sequencing machines.
    • Celera Assembler can run on DNA sequence from Sanger sequencers such as the ABI PRISM 3730xl.
    • Celera Assembler can run on DNA sequence from 454 sequencers including the FLX standard and FLX titanium. CA cannot run directly on the short reads from the 454 GS 20. (The GS 20 reads can be pre-assembled with Newbler and fed to CA as shredded contigs.)
    • Celera Assembler can run on DNA sequence from the Illumina Solexa sequencers. This support started with CA 6.0, and includes long reads (75bp and longer) only.
  • Celera Assembler DOES NOT support:
    • Data that is totally lacking in paired ends. (This is sometimes called fragment data or shotgun data.)
    • Reads that are shorter than 64 bp. (And 100bp is the minimum for good assembly.)
    • Reads from non-genomic samples such as cDNA, RNA, and PCR amplicons.
    • Non-random strategies such as CoT and exon enrichment.
    • Reads that include sequencing adapter or muliplex tags.

Input Requirements

  • Sanger data is expected in FRG format, a record-oriented text file format specific to Celera Assembler. See the FRG Files guide.
  • For 454 data, SFF files are expected. Celera Assembler includes a pre-processor, sffToCA, that converts SFF to FRG. 454 support started in CA 5.0.
  • For Solexa data, either of two equivalent formats is expected: FASTQ and one-read-per-line (what is that format called?). Solexa support starts in CA 6.0.

Operating System Requirements

  • A modern version of Unix.
    • Linux is ok (usually tested on Red Hat, CentOS, Suse)
    • FreeBSD is ok
    • Mac OS-X is ok; use the 'terminal' program for the unix command line
    • Solaris? We don't test on Solaris. Some users have managed it. Some have had problems.
  • Binaries are supplied for Linux. Users of other systems, and users of the latest source code, will want to build binaries from source. These users need 'gcc' and 'make' and 'perl'.

Hardware Requirements

  • Intel-compatible CPU. AMD Opteron works too.
  • 64-bit processor. (It is possible to build the binaries on 32-bit processors. The resulting programs will have applicability limited to small genomes such as bacteria.
  • Lots of RAM. Perhaps 2GB RAM for a bacteria. Up to 32 GB RAM for eukaryotes at 10X. Something like 100GB for data sets of 200 million reads.
  • Lots of DISK. Perhaps 1TB or more for a mammalian genome. The disk need not be fast, as CA is usually CPU-bound.