TracearchiveToCA

From wgs-assembler
Jump to: navigation, search


(formerly known as tracedb-to-frg.pl)

Takes as input a list of NCBI Trace Archive formatted fasta, qual and xml files. Writes Celera Assembler FRG format output. Designed for converting LARGE amounts of data, using SGE.

It operates in three steps, the first and last are parallel.

  1. Parse the xml.prefix.### files. Parallel.
  2. Scan the parsed xml to find libraries and mates. Writes CA LIB and LKG messages. Sequential.
  3. Scan the fasta/qual files, write FRG messages. Parallel.

If you have SGE installed, you can run all three stages, parallel if appropriate, with:

tracearchiveToCA -sgeoptions '-P project -A account etc' -sge xml*

Otherwise, you'll need to run (sequentially) with:

tracearchiveToCA -xml xml  (once per xml file)
tracearchiveToCA -lib xml* (ALL xml files)
tracearchiveToCA -frg xml  (once per xml file)

Output goes in the CURRENT DIRECTORY.

Script diagnostic output from steps 1 and 3 should be empty; output from step 2 will contain the mate rates.

The output is:
*prefix.1.lib.frg -- all library info
*prefix.2.###.frg -- fragment data, one per input fasta/qual pair
*prefix.3.lkg.frg -- mate info

Other output:
* *.lib -- a map from TA frag id to TA library id, along with some info about the library
* *.frglib -- a map from TA frag id to CA library UID

In more detail, the three phases of operation are:

# tracearchiveToCA -xml <xml_organism_001> : This reads one XML file and writes one temp file like tafrg-organism/organism.001.lib. Run this once for each XML file you have. You can run these in parallel. This runs quickly.
# tracearchiveToCA -lib <xml*> : This reads all the XML files. It writes temp files like tafrg-organism/organism.???.frglib. It writes the permanent file organism.1.lib.frg, the prefix to your final FRG file. It also writes the permanent file organism.3.lkg.frg, the suffix to your final FRG file. Make sure you run this on all the XML files at once; the script needs to find all reads for each library and libraries may be spread across XML files. This runs for an intermediate amount of time.
# tracearchiveToCA -frg <xml_organism_001> : This reads one XML file and writes one permanent file like organism.2.001.frg. Run this once for each XML file you have. You can run these in parallel. The collection of outputs forms the middle section of your final FRG file. This runs for a long time.

The result is a set of FRG files.  For convenience, these can be merged into one file ('cat organism.?.???.frg > organism.frg').  If they are not merged, then they must be supplied to runCA in order: the *.1.lib.frg (library information) must be first, then the *.2.???.frg (the fragment data), then *.3.lkg.frg (the mate pair associations).

== Example ==

I have downloaded all of the fasta, qual and xml files for Anopheles gambiae S from ftp://ftp.ncbi.nih.gov/pub/TraceDB/anopheles_gambiae_s.

<pre>
% ls -l
total 2086780K
-rw-rw-r-- 1 bwalenz atg 144974165 Feb 26  2010 fasta.anopheles_gambiae_s.001.bz2
-rw-rw-r-- 1 bwalenz atg 142363781 Feb 26  2010 fasta.anopheles_gambiae_s.002.bz2
-rw-rw-r-- 1 bwalenz atg 141866818 Feb 26  2010 fasta.anopheles_gambiae_s.003.bz2
-rw-rw-r-- 1 bwalenz atg 141635494 Feb 26  2010 fasta.anopheles_gambiae_s.004.bz2
-rw-rw-r-- 1 bwalenz atg 139100074 Feb 26  2010 fasta.anopheles_gambiae_s.005.bz2
-rw-rw-r-- 1 bwalenz atg  58630914 Feb 26  2010 fasta.anopheles_gambiae_s.006.bz2
-rw-rw-r-- 1 bwalenz atg 246313887 Feb 26  2010 qual.anopheles_gambiae_s.001.bz2
-rw-rw-r-- 1 bwalenz atg 244711323 Feb 26  2010 qual.anopheles_gambiae_s.002.bz2
-rw-rw-r-- 1 bwalenz atg 239172489 Feb 26  2010 qual.anopheles_gambiae_s.003.bz2
-rw-rw-r-- 1 bwalenz atg 238825449 Feb 26  2010 qual.anopheles_gambiae_s.004.bz2
-rw-rw-r-- 1 bwalenz atg 238380750 Feb 26  2010 qual.anopheles_gambiae_s.005.bz2
-rw-rw-r-- 1 bwalenz atg 109804426 Feb 26  2010 qual.anopheles_gambiae_s.006.bz2
-rw-rw-r-- 1 bwalenz atg   8636461 Feb 26  2010 xml.anopheles_gambiae_s.001.bz2
-rw-rw-r-- 1 bwalenz atg   8642117 Feb 26  2010 xml.anopheles_gambiae_s.002.bz2
-rw-rw-r-- 1 bwalenz atg   8527419 Feb 26  2010 xml.anopheles_gambiae_s.003.bz2
-rw-rw-r-- 1 bwalenz atg   8521969 Feb 26  2010 xml.anopheles_gambiae_s.004.bz2
-rw-rw-r-- 1 bwalenz atg   8554886 Feb 26  2010 xml.anopheles_gambiae_s.005.bz2
-rw-rw-r-- 1 bwalenz atg   3899287 Feb 26  2010 xml.anopheles_gambiae_s.006.bz2

The conversion to FRG format proceeds in three stages.

Reading the XML

Each XML files is processed to build a list of the libraries included, the fragments and their mate relationships.

% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -xml xml.anopheles_gambiae_s.001.bz2
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -xml xml.anopheles_gambiae_s.002.bz2
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -xml xml.anopheles_gambiae_s.003.bz2
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -xml xml.anopheles_gambiae_s.004.bz2
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -xml xml.anopheles_gambiae_s.005.bz2
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -xml xml.anopheles_gambiae_s.006.bz2

The outputs are stored in the tafrg-anopheles_gambiae_s directory, created in the same directory where tracearchiveToCA is executed from. Each file contains a list of the fragments found, their template id and orientation, the library and insert size estimate, and clear ranges.

% ls -l tafrg-anopheles_gambiae_s
total 364878K
-rw-rw-r-- 1 bwalenz atg 58277755 Sep  9 13:13 anopheles_gambiae_s.001.lib
-rw-rw-r-- 1 bwalenz atg 58212274 Sep  9 13:20 anopheles_gambiae_s.002.lib
-rw-rw-r-- 1 bwalenz atg 57036222 Sep  9 13:20 anopheles_gambiae_s.003.lib
-rw-rw-r-- 1 bwalenz atg 56330834 Sep  9 13:20 anopheles_gambiae_s.004.lib
-rw-rw-r-- 1 bwalenz atg 56411508 Sep  9 13:20 anopheles_gambiae_s.005.lib
-rw-rw-r-- 1 bwalenz atg 24785620 Sep  9 13:19 anopheles_gambiae_s.006.lib
% more tafrg-anopheles_gambiae_s/anopheles_gambiae_s.001.lib 
1429179647      1061029030224   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    17,944  0,1009  17,944
1429179648      1061029030226   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    21,960  0,1006  21,960
1429179649      1061029030228   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    22,964  0,1016  22,964
1429179650      1061029030230   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    75,776  75,1011 19,776
1429179651      1061029030232   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    112,948 112,1015        13,948
1429179652      1061029030234   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    68,807  0,1013  68,807
1429179653      1061029030236   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    11,808  0,999   11,808
1429179654      1061029030238   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    20,934  0,1025  20,934
1429179655      1061029030240   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    25,949  0,1018  25,949
1429179656      1061029030242   REVERSE ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB     11000   2200    17,909  0,1002  17,909
.
.
.

Writing the LIB and LKG messages

The intermediate output in tafrg-anopheles_gambiae_s are now scanned to pair mates. Because we are not guaranteed to have all pairs of mates in a single XML file, we must scan the entire batch at once. While this runs, the number of fragments and mate pairs found, and the fraction mated, is reported to the screen. The last files are finding more mates than are in each file, because some of the mated fragments are in earlier files.

% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -lib xml.anopheles_gambiae_s.*.bz2
anopheles_gambiae_s.001: frags=500000 links=249371 (99.74%)
anopheles_gambiae_s.002: frags=500001 links=249337 (99.73%)
anopheles_gambiae_s.003: frags=500000 links=242547 (97.01%)
anopheles_gambiae_s.004: frags=500000 links=243290 (97.31%)
anopheles_gambiae_s.005: frags=500000 links=253230 (101.29%)
anopheles_gambiae_s.006: frags=214216 links=115969 (108.27%)

Output of this is also stored in tafrg-anopheles_gambiae_s, in the *frglib files. These contain a mapping of template id to sequencing library.

% ls -l tafrg-anopheles_gambiae_s/
total 561908K
-rw-rw-r-- 1 bwalenz atg 31500000 Sep  9 13:46 anopheles_gambiae_s.001.frglib
-rw-rw-r-- 1 bwalenz atg 58277755 Sep  9 13:13 anopheles_gambiae_s.001.lib
-rw-rw-r-- 1 bwalenz atg 31500063 Sep  9 13:46 anopheles_gambiae_s.002.frglib
-rw-rw-r-- 1 bwalenz atg 58212274 Sep  9 13:20 anopheles_gambiae_s.002.lib
-rw-rw-r-- 1 bwalenz atg 30879984 Sep  9 13:47 anopheles_gambiae_s.003.frglib
-rw-rw-r-- 1 bwalenz atg 57036222 Sep  9 13:20 anopheles_gambiae_s.003.lib
-rw-rw-r-- 1 bwalenz atg 30522472 Sep  9 13:47 anopheles_gambiae_s.004.frglib
-rw-rw-r-- 1 bwalenz atg 56330834 Sep  9 13:20 anopheles_gambiae_s.004.lib
-rw-rw-r-- 1 bwalenz atg 30148781 Sep  9 13:47 anopheles_gambiae_s.005.frglib
-rw-rw-r-- 1 bwalenz atg 56411508 Sep  9 13:20 anopheles_gambiae_s.005.lib
-rw-rw-r-- 1 bwalenz atg 13218522 Sep  9 13:47 anopheles_gambiae_s.006.frglib
-rw-rw-r-- 1 bwalenz atg 24785620 Sep  9 13:19 anopheles_gambiae_s.006.lib

% more tafrg-anopheles_gambiae_s/anopheles_gambiae_s.001.frglib
1429179647      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179648      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179649      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179650      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179651      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179652      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179653      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179654      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179655      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
1429179656      ANOPHELES-GAMBIAE-PIMPERENA-S_AGAMBIAE-G-02-10-12KB
.
.
.

After this stage, we have two fragment files files, one containing the LIB messages, the other containing the LKG messages.

% ls -l *.frg
-rw-rw-r-- 1 bwalenz atg      575 Sep  9 13:47 anopheles_gambiae_s.1.lib.frg
-rw-rw-r-- 1 bwalenz atg 58211005 Sep  9 13:47 anopheles_gambiae_s.3.lkg.frg

Writing the FRG messages

The last stage, the most expensive, rewrites the fragments into FRG messages. Like the first stage, this must be run once per input file. Each will report the number of fragments found.

% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -frg xml.anopheles_gambiae_s.001.bz2&
Found 107108 in tafrg-anopheles_gambiae_s/anopheles_gambiae_s.006.frglib.
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -frg xml.anopheles_gambiae_s.002.bz2&
Found 250000 in tafrg-anopheles_gambiae_s/anopheles_gambiae_s.004.frglib.
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -frg xml.anopheles_gambiae_s.003.bz2&
Found 250001 in tafrg-anopheles_gambiae_s/anopheles_gambiae_s.002.frglib.
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -frg xml.anopheles_gambiae_s.004.bz2&
Found 250000 in tafrg-anopheles_gambiae_s/anopheles_gambiae_s.001.frglib.
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -frg xml.anopheles_gambiae_s.005.bz2&
Found 250000 in tafrg-anopheles_gambiae_s/anopheles_gambiae_s.005.frglib.
% perl $bri/wgs/Linux-amd64/bin/tracearchiveToCA -frg xml.anopheles_gambiae_s.006.bz2&
Found 250000 in tafrg-anopheles_gambiae_s/anopheles_gambiae_s.003.frglib.

This leaves us with numerous fragment files, one for each piece. Note that some versions of tracearchiveToCA will compress the fragnment files that contain FRG messages (the ones with three digits at the end of the name).

% ls -l *frg
-rw-rw-r-- 1 bwalenz atg        575 Sep  9 13:47 anopheles_gambiae_s.1.lib.frg
-rw-rw-r-- 1 bwalenz atg 1083204902 Sep  9 14:19 anopheles_gambiae_s.2.001.frg
-rw-rw-r-- 1 bwalenz atg 1065406126 Sep  9 14:19 anopheles_gambiae_s.2.002.frg
-rw-rw-r-- 1 bwalenz atg 1064741260 Sep  9 14:18 anopheles_gambiae_s.2.003.frg
-rw-rw-r-- 1 bwalenz atg 1065800259 Sep  9 14:19 anopheles_gambiae_s.2.004.frg
-rw-rw-r-- 1 bwalenz atg 1038337676 Sep  9 14:17 anopheles_gambiae_s.2.005.frg
-rw-rw-r-- 1 bwalenz atg  472021190 Sep  9 14:07 anopheles_gambiae_s.2.006.frg
-rw-rw-r-- 1 bwalenz atg   58211005 Sep  9 13:47 anopheles_gambiae_s.3.lkg.frg

At this time the intermediate directory tafrg-anopheles_gambiae_s can be removed.

Error Messages

Nothing to do.  'organism.1.lib.frg' and 'organism.3.lkg.frg' exist.
(If later steps fail, remove these two files to recompute intermediate results)

This happens during phase 2 launched with 'tracearchiveToCA -lib xml*'. The program wants to build one file of LIB messages and one file of LKG messages to represent the whole data set. If those files already exist, then something has gone wrong. Delete those files and start again. Make sure you provide all the XML files at once (using the wildcard xml*) on the command line.