RunCA Dissection

From wgs-assembler
Jump to: navigation, search

!!! WARNING !!!

This page describes runCA version 6.1. It has not been updated to runCA version 8.

!!! WARNING !!!

The Celera Assembler consists of numerous stand alone programs, coordinated by the 'executive script' called runCA. runCA reports the command line of each program it runs, giving both a progress indication, and a description of what is being done. This document annotates the runCA output, describing each stage, and the various files used or generated by each.

Fragments and the runCA Command

We'll use a half-plate of 454 FLX 3Kbp mate pair fragments, from porphyromonas gingivalis w83. The SFF file is already converted to FRG format, resulting in 553,724 fragments.

runCA is configured to run with extremely small batch sizes, to show that some of the steps run multiple instances of a single command.

I have a default .runCA specFile in my home directory, which contains:

fakeUIDs      = 1

#  SGE configuration
#
useGrid       = 0
scriptOnGrid  = 0

frgCorrOnGrid = 1
ovlCorrOnGrid = 1

sge                   = -A assembly
sgeScript             = -pe thread 2 -l memory=4g -p  400
sgeOverlap            = -pe thread 2 -l memory=2g -p -600
sgeMerOverlapSeed     = -pe thread 2 -l memory=2g -p -600
sgeMerOverlapExtend   = -pe thread 2 -l memory=2g -p -500
sgeConsensus          = -pe thread 1 -l memory=2g -p -600
sgeFragmentCorrection = -pe thread 2 -l memory=3g -p -500
sgeOverlapCorrection  = -pe thread 1 -l memory=4g -p -400

#  MERYL configuration
#
merylMemory   = -segments 32 -threads 4
merylThreads  = 4

#  TRIMMING configuration
#
#doOBT    = 1
#doMBT    = 0

#  OVERLAPPER configuration
#
ovlMemory        = 4GB --hashload 0.8 --hashstrings 100000
ovlThreads       = 2
ovlHashBlockSize = 180000
ovlRefBlockSize  = 2000000

#  MER OVERLAPPER configuration
#
merOverlapperThreads         = 2
merOverlapperSeedBatchSize   = 90000
merOverlapperExtendBatchSize = 90000

#  ERROR CORRECTION configuration
#
#  Dros used 500,000, using 7-8GB with Sanger reads
frgCorrBatchSize = 200000
frgCorrThreads   = 2

#  800000 = 10GB on sanger
ovlCorrBatchSize = 200000

#  UNITIGGER configuration
#

#  SCAFFOLDER configuration
#
doExtendClearRanges = 2

The assembly is launched with the following runCA command, which will reset some of the parameters set in the specFile:

runCA \
  -p dissect \
  -d dissect \
  ovlRefBlockSize=50000 \      # Run many overlap commands
  ovlHashBlockSize=50000 \
  cnsPartitions=16 \           # Run many consensus commands
  cnsMinFrags=50000 \
  porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS0.frg

Also annotated below are where some of the stopAfter break points are located.

Load fragment data

The very first thing runCA does is create a new directory for the assembly. We call this the root directory of the assembly.

gatekeeper

bin/gatekeeper \
  -o dissect.gkpStore.BUILDING \
  -T \
  -F \
  -E dissect.gkpStore.errorLog \
  porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS01.frg \
  > dissect.gkpStore.err 2>&1

The gatekeeper command loads fragment data from the input FRG files into a gatekeeper store, also called the gkpStore. It performs a few checks on the inputs, reporting any errors or warnings to the dissect.gkpStore.errorLog file in the root assembly directory. The dissect.gkpStore.err file contains a summary of the errors.

We strongly suggest you examine at least the dissect.gkpStore.err summary file soon after starting an assembly.

dissect.gkpStore.err

Starting file 'porphyromonas_gingivalis_w83.flx.3200bp.0900bp.E8YURXS0.frg' at line 0.
GKP finished with 172554 errors:
61029   # FRG Error: sequence length < min allowed sequence length.
61029   # FRG Alert: loaded, but marked as deleted due to errors previously reported.
50496   # LKG Error: Fragment is marked as deleted.

dissect.gkpStore.errorLog

# FRG Error: Fragment E8YURXS01A8ZIW sequence length 62 < 64 min allowed sequence length.
# FRG Alert: Fragment E8YURXS01A8ZIW loaded, but marked as deleted due to errors previously reported.
# FRG Error: Fragment E8YURXS01ENL4W sequence length 62 < 64 min allowed sequence length.
# FRG Alert: Fragment E8YURXS01ENL4W loaded, but marked as deleted due to errors previously reported.
# FRG Error: Fragment E8YURXS01BTO9T sequence length 52 < 64 min allowed sequence length.
# FRG Alert: Fragment E8YURXS01BTO9T loaded, but marked as deleted due to errors previously reported.
# FRG Error: Fragment E8YURXS01ATO85 sequence length 59 < 64 min allowed sequence length.
# FRG Alert: Fragment E8YURXS01ATO85 loaded, but marked as deleted due to errors previously reported.
# FRG Error: Fragment E8YURXS01CVI4G sequence length 57 < 64 min allowed sequence length.
# FRG Alert: Fragment E8YURXS01CVI4G loaded, but marked as deleted due to errors previously reported.
.
.
.
stopAfter = initialStoreBuilding

OBT, Overlap Based Trimming

The Overlap Based Trimming module is one of the most complicated modules to run. It requires computing overlaps, which in turn require generating a mer count histogram. Then, there are four different programs that compute the trimming.

Compute an initial trimming based on quality values

bin/initialTrim \
 -log 0-overlaptrim/dissect.initialTrimLog \
 -frg dissect.gkpStore \
 >  0-overlaptrim/dissect.initialTrim.report \
 2> 0-overlaptrim/dissect.initialTrim.err 

The first step is to replace any existing clear range on the input fragments with one derived from a windowed average on the quality values. This clear range is designed to be extremely loose, only trimming out the obviously bad sequence. Windows below average quality value of 12 are trimmed. This program logs changes made to fragments in dissect.initialTrimLog and summarizes the changes in dissect.initialTrim.report. Typically, dissect.initialTrim.err is empty.

0-overlaptrim/dissect.initialTrim.report

Fragments with:
 no changes allowed:           0
 no QV trim allowed:           492695
 already deleted               61029
 no vector clear range known:  492695 (trimed to quality clear)
 no HQ non-vector sequence:    0 (deleted)
 HQ vector trimmed:            0 (trimmed to intersection)
     5' end:                   0
     3' end:                   0
 LQ vector trimmed:            0 (trimmed to intersection)
     5' end:                   0
     3' end:                   0
 final clear range too short:  0 (deleted)

Of note in the report are the number of fragments that do not allow changes (from the 'doNotOverlapTrim' library feature), the number of fragments that do not allow QV trimming (from the 'doNotQVTrim' library feature), or are already deleted (via gatekeeper filtering). If the fragment has a vector clear range, the number of fragments with high or low quality vector sequence trimmed is reported. Finally, if trimming made the fragment too short, that is reported.

0-overlaptrim/dissect.initialTrimLog

uid,iid                 origL   origR   qltL    qltR    vecL    vecR    finalL  finalR  deleted?
E8YURXS01ESZ1R,1        0       248     0       248     1       0       0       248
E8YURXS01AVULC,2        0       104     0       104     1       0       0       104
E8YURXS01DYACD,3        0       302     0       302     1       0       0       302
E8YURXS01AZQ3U,4        0       150     0       150     1       0       0       150
E8YURXS01CZHP2,5        0       65      0       65      1       0       0       65

The log is tab delimited (here changed to spaces for easier viewing). It shows the name of the fragment, three original clear ranges (orig, quality and vector) and the final quality trim (final). If the fragment was deleted, an entry in the 'deleted?' column will exist.

Compute overlaps

Overlaps require a list of frequent mers generated by meryl. This list tells overlapper that overlaps seeded with these mers are likely to be uninformative, and so it can skip them.

Compute mer histograms to filter out uninformative seeds

bin/meryl \
  -B -C -v -m 22 -segments 8 -threads 4 -threads 4 -c 0  -L 2 \
  -s dissect.gkpStore:chain \
  -o 0-mercounts/dissect-C-ms22-cm0 \
  > 0-mercounts/meryl.err 2>&1

Meryl reads each fragment in the gkpStore and generates a histogram of k-mer frequencies. As shown, this command will compute the histogram by breaking the input fragments into 8 segments, and computing 4 of them at a time. The output is stored in two binary files, '0-mercounts/dissect-C-ms22-cm0.mcidx' and '0-mercounts/dissect-C-ms22-cm0.mcdat'.

The meryl command is documented elsewhere. The meryl that comes with the Celera Assembler is extended to read the gatekeeper store directly, using the special input name 'dissect.gkpStore:chain'.

Estimate mer threshold

bin/estimate-mer-threshold \
  -g dissect.gkpStore:chain \
  -m 0-mercounts/dissect-C-ms22-cm0 \
  > 0-mercounts/dissect-C-ms22-cm0.estMerThresh.out \
  2> 0-mercounts/dissect-C-ms22-cm0.estMerThresh.err

The estimate-mer-threshold program will examine the k-mer histogram generated by meryl, and attempt to compute the depth of coverage, and from there, a reasonable cutoff for when a k-mer seed becomes uninformative. The method used works reasonably well on well-behaved assemblies. On assemblies with low coverage, or high rates of polymorphism, or metagenomes, a human can usually pick a better threshold. It's decision is shown in 'dissect-C-ms22-cm0.estMerThresh.out', and a log of the decision is in the corresponding '.err' file.

dissect-C-ms22-cm0.estMerThresh.err

distinct: 7299536
unique:   4549182
total:    80901786
Xcoverage zero 1 0 4549182
Xcoverage drop 2 4549182 348791
Xcoverage drop 3 348791 95063
Xcoverage drop 4 95063 39064
Xcoverage drop 5 39064 19952
Xcoverage drop 6 19952 10822
Xcoverage drop 7 10822 6670
Xcoverage drop 8 6670 4857
Xcoverage drop 9 4857 3746
Xcoverage drop 10 3746 3179
Xcoverage drop 11 3179 2956
Xcoverage incr 12 2956 3208
Xcoverage incr 13 3208 4035
Xcoverage incr 14 4035 5610
Xcoverage incr 15 5610 7852
Xcoverage incr 16 7852 11149
Xcoverage incr 17 11149 15311
Xcoverage incr 18 15311 20950
Xcoverage incr 19 20950 27679
Xcoverage incr 20 27679 34967
Xcoverage incr 21 34967 44594
Xcoverage incr 22 44594 54705
Xcoverage incr 23 54705 66032
Xcoverage incr 24 66032 76090
Xcoverage incr 25 76090 86160
Xcoverage incr 26 86160 94603
Xcoverage incr 27 94603 102949
Xcoverage incr 28 102949 109427
Xcoverage incr 29 109427 115498
Xcoverage incr 30 115498 117448
Xcoverage done 31 117448 115585
Guessed X coverage is 30
Set maxCount to 77, which will cover 99.01% of distinct mers and 93.49% of all mers.
Reset maxCount to 380, which will cover 99.94% of distinct mers and 98.38% of all mers.

When the decision is flawed, you can try to pick a better k-mer size and threshold by hand.

Output uninformative seeds

bin/meryl \
  -Dt -n 319 -s 0-mercounts/dissect-C-ms22-cm0 \
  > 0-mercounts/dissect.nmers.ovl.fasta \
  2> 0-mercounts/dissect.nmers.ovl.fasta.err 
bin/meryl \
  -Dt -n 319 -s 0-mercounts/dissect-C-ms22-cm0 \
  > 0-mercounts/dissect.nmers.obt.fasta \
  2> 0-mercounts/dissect.nmers.obt.fasta.err 

The 'ovl' overlapper reads the list of uninformative k-mer seeds from a fasta file. Two files are generated at this, one for 'obt' overlaps and the other for 'ovl' overlaps. This example shows two files with the same threshold being generated; if 'obtMerThreshold' and 'ovlMerThreshold' are set differently, these files will differ. These files are multi-fasta, with the k-mer frequency on the defline.

These kmers are 'canonical' -- instead of counting both the forward and reverse strands of the sequences, we count the smaller of the mers. If our sequence is ACT (reverse complement AGT) instead of counting four 2-mers (AC, CT and AG, GT) we count two 2-mers: MIN(AC, GT) = AC, and MIN(CT, AG) = AG.

dissect.nmers.obt.fasta

>514
AAAAAAGTTGACAGTTGGGCGG
>717
AAAAAAGTTTTCAACAACGAAG
>925
AAAAACAGCATAATATTCCTCC
>666
AAAAACAGGTTGTCCGTCAGCG
>504
AAAAACCCGGAATTCAAACCCT

Note that the assembler has two different overlap algorithms, 'mer' and 'ovl', and two different types of overlaps, 'obt' and 'ovl'. The 'obt' overlaps are used for trimming while the 'ovl' overlaps are used to generate unitigs. Either overlap algorithm can be used to generate the overlaps.

Compute Overlaps

0-overlaptrim-overlap/overlap.sh 1 > 0-overlaptrim-overlap/overlap.0001.out 2>&1
0-overlaptrim-overlap/overlap.sh 2 > 0-overlaptrim-overlap/overlap.0002.out 2>&1
0-overlaptrim-overlap/overlap.sh 3 > 0-overlaptrim-overlap/overlap.0003.out 2>&1
0-overlaptrim-overlap/overlap.sh 4 > 0-overlaptrim-overlap/overlap.0004.out 2>&1
0-overlaptrim-overlap/overlap.sh 5 > 0-overlaptrim-overlap/overlap.0005.out 2>&1
0-overlaptrim-overlap/overlap.sh 6 > 0-overlaptrim-overlap/overlap.0006.out 2>&1
0-overlaptrim-overlap/overlap.sh 7 > 0-overlaptrim-overlap/overlap.0007.out 2>&1
0-overlaptrim-overlap/overlap.sh 8 > 0-overlaptrim-overlap/overlap.0008.out 2>&1
0-overlaptrim-overlap/overlap.sh 9 > 0-overlaptrim-overlap/overlap.0009.out 2>&1
.
.
.

The ovlHashBlockSize and ovlRefBlockSize parameters describe how to decompose the full overlap computation into smaller pieces. Configuring overlapper is difficult, and will be described elsewhere (that there is no link to 'elsewhere' here indicates that we have not yet described how to do it).

runCA is computing overlaps to be used for trimming, in the 0-overlaptrim-overlap directory. At the end of the computation, our directory has many, many log files.

% ls 0-overlaptrim-overlap
0000000001              overlap.0017.out        overlap.0034.out        overlap.0051.out        overlap.0068.out
overlap.0001.out        overlap.0018.out        overlap.0035.out        overlap.0052.out        overlap.0069.out
overlap.0002.out        overlap.0019.out        overlap.0036.out        overlap.0053.out        overlap.0070.out
overlap.0003.out        overlap.0020.out        overlap.0037.out        overlap.0054.out        overlap.0071.out
overlap.0004.out        overlap.0021.out        overlap.0038.out        overlap.0055.out        overlap.0072.out
overlap.0005.out        overlap.0022.out        overlap.0039.out        overlap.0056.out        overlap.0073.out
overlap.0006.out        overlap.0023.out        overlap.0040.out        overlap.0057.out        overlap.0074.out
overlap.0007.out        overlap.0024.out        overlap.0041.out        overlap.0058.out        overlap.0075.out
overlap.0008.out        overlap.0025.out        overlap.0042.out        overlap.0059.out        overlap.0076.out
overlap.0009.out        overlap.0026.out        overlap.0043.out        overlap.0060.out        overlap.0077.out
overlap.0010.out        overlap.0027.out        overlap.0044.out        overlap.0061.out        overlap.0078.out
overlap.0011.out        overlap.0028.out        overlap.0045.out        overlap.0062.out        overlap.sh
overlap.0012.out        overlap.0029.out        overlap.0046.out        overlap.0063.out        ovljobs.dat
overlap.0013.out        overlap.0030.out        overlap.0047.out        overlap.0064.out        ovlopts.pl
overlap.0014.out        overlap.0031.out        overlap.0048.out        overlap.0065.out
overlap.0015.out        overlap.0032.out        overlap.0049.out        overlap.0066.out
overlap.0016.out        overlap.0033.out        overlap.0050.out        overlap.0067.out

% ls 0-overlaptrim-overlap/0000000001
h0000000001r0000000000.ovb.gz   h0000250001r0000250000.ovb.gz   h0000400001r0000200000.ovb.gz   h0000500001r0000250000.ovb.gz
h0000050001r0000000000.ovb.gz   h0000300001r0000000000.ovb.gz   h0000400001r0000250000.ovb.gz   h0000500001r0000300000.ovb.gz
h0000050001r0000050000.ovb.gz   h0000300001r0000050000.ovb.gz   h0000400001r0000300000.ovb.gz   h0000500001r0000350000.ovb.gz
h0000100001r0000000000.ovb.gz   h0000300001r0000100000.ovb.gz   h0000400001r0000350000.ovb.gz   h0000500001r0000400000.ovb.gz
h0000100001r0000050000.ovb.gz   h0000300001r0000150000.ovb.gz   h0000400001r0000400000.ovb.gz   h0000500001r0000450000.ovb.gz
h0000100001r0000100000.ovb.gz   h0000300001r0000200000.ovb.gz   h0000450001r0000000000.ovb.gz   h0000500001r0000500000.ovb.gz
h0000150001r0000000000.ovb.gz   h0000300001r0000250000.ovb.gz   h0000450001r0000050000.ovb.gz   h0000550001r0000000000.ovb.gz
h0000150001r0000050000.ovb.gz   h0000300001r0000300000.ovb.gz   h0000450001r0000100000.ovb.gz   h0000550001r0000050000.ovb.gz
h0000150001r0000100000.ovb.gz   h0000350001r0000000000.ovb.gz   h0000450001r0000150000.ovb.gz   h0000550001r0000100000.ovb.gz
h0000150001r0000150000.ovb.gz   h0000350001r0000050000.ovb.gz   h0000450001r0000200000.ovb.gz   h0000550001r0000150000.ovb.gz
h0000200001r0000000000.ovb.gz   h0000350001r0000100000.ovb.gz   h0000450001r0000250000.ovb.gz   h0000550001r0000200000.ovb.gz
h0000200001r0000050000.ovb.gz   h0000350001r0000150000.ovb.gz   h0000450001r0000300000.ovb.gz   h0000550001r0000250000.ovb.gz
h0000200001r0000100000.ovb.gz   h0000350001r0000200000.ovb.gz   h0000450001r0000350000.ovb.gz   h0000550001r0000300000.ovb.gz
h0000200001r0000150000.ovb.gz   h0000350001r0000250000.ovb.gz   h0000450001r0000400000.ovb.gz   h0000550001r0000350000.ovb.gz
h0000200001r0000200000.ovb.gz   h0000350001r0000300000.ovb.gz   h0000450001r0000450000.ovb.gz   h0000550001r0000400000.ovb.gz
h0000250001r0000000000.ovb.gz   h0000350001r0000350000.ovb.gz   h0000500001r0000000000.ovb.gz   h0000550001r0000450000.ovb.gz
h0000250001r0000050000.ovb.gz   h0000400001r0000000000.ovb.gz   h0000500001r0000050000.ovb.gz   h0000550001r0000500000.ovb.gz
h0000250001r0000100000.ovb.gz   h0000400001r0000050000.ovb.gz   h0000500001r0000100000.ovb.gz   h0000550001r0000550000.ovb.gz
h0000250001r0000150000.ovb.gz   h0000400001r0000100000.ovb.gz   h0000500001r0000150000.ovb.gz
h0000250001r0000200000.ovb.gz   h0000400001r0000150000.ovb.gz   h0000500001r0000200000.ovb.gz

By default (saveOverlaps=0) runCA will remove the raw overlapper output in the 0000000001 directory (all those h000*r000*.ovb.gz files) after the ovlStore has been created. If you have these files, the overlap store page describes what you can do with them.

Of importance here is how to restart overlapper after a job fails. We hope that overlapper fails only due to resource limits (out of disk space, out of compute time, out of memory, etc), but maybe it just crashed.

(Aside: To generate this example, I simply deleted one of the overlap output files in the '0000000001' directory before runCA had finished computing all overlaps. After runCA has run each overlap job, it will check the outputs. Since I deleted one of them, it complains that one job failed to finish successfully.)

An example of runCA failing because some overlap jobs failed to finish

% runCA ...
.
.
.
0-overlaptrim-overlap/0000000001/h0000050001r0000050000 failed, job index 3.
================================================================================

runCA failed.

----------------------------------------
Stack trace:

 at /work/wgs/FreeBSD-amd64/bin/runCA line 1121
        main::caFailure('1 overlapper jobs failed', undef) called at /work/wgs/FreeBSD-amd64/bin/runCA line 1271
        main::checkOverlapper('trim') called at /work/wgs/FreeBSD-amd64/bin/runCA line 1309
        main::checkOverlap('trim') called at /work/wgs/FreeBSD-amd64/bin/runCA line 3090
        main::overlapTrim() called at /work/wgs/FreeBSD-amd64/bin/runCA line 5311

----------------------------------------
Failure message:

1 overlapper jobs failed

To rerun the missing/crashed overlap jobs, remove the overlap.sh file.

% rm 0-overlaptrim-overlap/overlap.sh
% runCA ...
----------------------------------------START CONCURRENT Thu Sep  2 22:44:20 2010
0-overlaptrim-overlap/overlap.sh 1 > 0-overlaptrim-overlap/overlap.0001.out 2>&1
0-overlaptrim-overlap/overlap.sh 2 > 0-overlaptrim-overlap/overlap.0002.out 2>&1
.
.
.
.
0-overlaptrim-overlap/overlap.sh 77 > 0-overlaptrim-overlap/overlap.0077.out 2>&1
0-overlaptrim-overlap/overlap.sh 78 > 0-overlaptrim-overlap/overlap.0078.out 2>&1
----------------------------------------END CONCURRENT Thu Sep  2 22:44:22 2010 (2 seconds)
----------------------------------------START Thu Sep  2 22:44:22 2010
find 0-overlaptrim-overlap -follow \( -name \*ovb.gz -or -name \*ovb \) -print > 0-overlaptrim/dissect.obtStore.list
----------------------------------------END Thu Sep  2 22:44:22 2010 (0 seconds)
----------------------------------------START Thu Sep  2 22:44:22 2010
/work/wgs/FreeBSD-amd64/bin/overlapStore  -O  -c 0-overlaptrim/dissect.obtStore.BUILDING  -g dissect.gkpStore  -M 1024 -L 0-overlaptrim/dissect.obtStore.list > 0-overlaptrim/dissect.obtStore.err 2>&1
----------------------------------------END Thu Sep  2 22:44:40 2010 (18 seconds)

Generate an overlap store for trimming

find 0-overlaptrim-overlap \
  -follow \( -name \*ovb.gz -or -name \*ovb \) -print \
  > 0-overlaptrim/dissect.obtStore.list

bin/overlapStore \
  -O \
  -c 0-overlaptrim/dissect.obtStore.BUILDING \
  -g dissect.gkpStore \
  -M 1024 -L 0-overlaptrim/dissect.obtStore.list \
  > 0-overlaptrim/dissect.obtStore.err 2>&1

The first command finds all the overlap output files, and the second command builds an overlap store. The -O option tells [[ovlStore | overlapStore] to retain only overlaps useful for trimming. The overlapStore command is also used to retrieve overlaps from the obtStore.

Generate an overlap store for detecting duplicate reads

find 0-overlaptrim-overlap \
  -follow \( -name \*ovb.gz -or -name \*ovb \) -print \
  > 0-overlaptrim/dissect.dupStore.list
bin/overlapStore \
 -O -O \
 -c 0-overlaptrim/dissect.dupStore.BUILDING \
 -g dissect.gkpStore \
 -M 1024 \
 -L 0-overlaptrim/dissect.dupStore.list \
 > 0-overlaptrim/dissect.dupStore.err 2>&1

Very similar to the previous overlapStore command, except this one uses an undocumented double negative '-O -O' option that tells it to retain only overlaps useful for duplicate detection. The overlapStore command is also used to retrieve overlaps from the store.

Detect duplicated reads

bin/deduplicate \
  -gkp     dissect.gkpStore \
  -ovs     0-overlaptrim/dissect.obtStore \
  -ovs     0-overlaptrim/dissect.dupStore \
  -report  0-overlaptrim/dissect.deduplicate.report \
  -summary 0-overlaptrim/dissect.deduplicate.summary \
  > 0-overlaptrim/dissect.deduplicate.err \
  2>&1

The 'deduplicate' program examines overlaps in both the obtStore and dupStore to detect probable duplicate fragments and duplicate (454) circles. It logs duplicates in the 'report' file and summarizes the results in the 'summary' file.

0-overlaptrim/dissect.deduplicate.summary

Checking library E8YURXS0 for duplicates.
duplicateFrags:    8506
duplicateMates:    781

0-overlaptrim/dissect.deduplicate.report

Delete E8YURXS01ARY0J,74 DUPof E8YURXS02GN8AR,125092 (a 0,154  b 0,155  hang 0,1  diff 0,0  error 0.006500
Delete E8YURXS01BJVOM,201 DUPof E8YURXS01BTRNL,108648 (a 0,264  b 0,265  hang 0,1  diff 0,0  error 0.003800
Delete E8YURXS01EFXBW,205 DUPof E8YURXS02HBAMQ,523333 (a 0,66  b 0,66  hang 0,0  diff 0,0  error 0.000000
Delete E8YURXS01A3Y4G,245 DUPof E8YURXS01AXGBK,27477 (a 0,257  b 0,258  hang 0,1  diff 0,0  error 0.007800
Delete E8YURXS01DF2QR,264 DUPof E8YURXS01EXUCE,118180 (a 0,237  b 0,239  hang 0,2  diff 0,0  error 0.008400
Delete E8YURXS01BMH8I,354 DUPof E8YURXS01B3VTF,4073 (a 0,260  b 0,260  hang 0,0  diff 0,0  error 0.003800
Delete E8YURXS01EEQA4,570 DUPof E8YURXS02JPL27,186919 (a 0,284  b 0,286  hang 0,2  diff 0,0  error 0.007000
Delete E8YURXS01EOV04,634 DUPof E8YURXS01C5HOQ,8398 (a 0,228  b 0,229  hang 0,1  diff 0,0  error 0.004400
Delete E8YURXS01B8S6B,682 DUPof E8YURXS01CD83G,16335 (a 0,210  b 0,210  hang 0,0  diff 0,0  error 0.000000
Delete E8YURXS01EDA0F,697 DUPof E8YURXS01ATCTS,7246 (a 0,267  b 0,267  hang 0,0  diff 0,0  error 0.003700
.
.
.
Delete E8YURXS01D2S0Oa,250431 <-> E8YURXS01D2S0Ob,250432 DUPof E8YURXS01D22H1a,399766 <-> E8YURXS01D22H1b,399767
Delete E8YURXS01EH0NAa,250503 <-> E8YURXS01EH0NAb,250504 DUPof E8YURXS01EESFMa,380902 <-> E8YURXS01EESFMb,380903
Delete E8YURXS01D9XCJa,250792 <-> E8YURXS01D9XCJb,250793 DUPof E8YURXS02HNZM7a,403377 <-> E8YURXS02HNZM7b,403378
Delete E8YURXS01D8PMEa,251273 <-> E8YURXS01D8PMEb,251274 DUPof E8YURXS01B1J5Ba,286677 <-> E8YURXS01B1J5Bb,286678
Delete E8YURXS01BJPH2a,251348 <-> E8YURXS01BJPH2b,251349 DUPof E8YURXS02JP9JOa,405784 <-> E8YURXS02JP9JOb,405785
Delete E8YURXS01EVJJ2a,251381 <-> E8YURXS01EVJJ2b,251382 DUPof E8YURXS02JNYGBa,552774 <-> E8YURXS02JNYGBb,552775
Delete E8YURXS01ES5BAa,251582 <-> E8YURXS01ES5BAb,251583 DUPof E8YURXS01C69ELa,375442 <-> E8YURXS01C69ELb,375443
Delete E8YURXS01BGABLa,251851 <-> E8YURXS01BGABLb,251852 DUPof E8YURXS01A7FWYa,312305 <-> E8YURXS01A7FWYb,312306
Delete E8YURXS01AH8A9a,251942 <-> E8YURXS01AH8A9b,251943 DUPof E8YURXS01CFLGNa,283926 <-> E8YURXS01CFLGNb,283927
Delete E8YURXS01CGJKSa,252400 <-> E8YURXS01CGJKSb,252401 DUPof E8YURXS01COBBOa,307494 <-> E8YURXS01COBBOb,307495
.
.
.

Duplicates are listed in no specific order.

Trim reads using overlaps

bin/consolidate \
 -ovs 0-overlaptrim/dissect.obtStore \
 > 0-overlaptrim/dissect.ovl.consolidated \
2> 0-overlaptrim/dissect.ovl.consolidated.err
bin/merge-trimming \
  -log 0-overlaptrim/dissect.mergeLog \
  -frg dissect.gkpStore \
  -ovl 0-overlaptrim/dissect.ovl.consolidated \
  > 0-overlaptrim/dissect.merge.err \

The 'consolidate' command reads all the overlaps useful for trimming (from the obtStore), computes several statistics and writes output to 'dissect.ovl.consolidated' (a text file useful only to 'merge-trimming').

The 'merge-trimming' command processes the 'ovl.consolidated' file and uses several heuristics to compute a clear range for each fragment. Some developer-useful statistics are presented in 'dissect.mergeLog.stats', and a log of the clear ranges is in 'dissect.mergeLog'.

dissect.mergeLog

E8YURXS01ESZ1R,1        24      248     0       248
E8YURXS01AVULC,2        24      95      0       95
E8YURXS01DYACD,3        24      302     0       302
E8YURXS01AZQ3U,4        24      114     19      114
E8YURXS01CZHP2,5        24      56      0       0 (deleted, too short)
E8YURXS01EMQU0,6        24      254     0       254
E8YURXS01EKPUL,7        24      251     0       251
E8YURXS01BX0B9,9        38      289     38      289
E8YURXS01EG20N,10       60      103     0       0 (deleted, too short)

The first pair of numbers is the 'initial' clear range (label OBTINITIAL), and the second pair is the final 'obt' clear range (label OBTMERGE). If the read is deleted, the end of the line is annotated with a description of why.

Detect chimeric and spur reads

bin/chimera \
 -gkp dissect.gkpStore \
 -ovs 0-overlaptrim/dissect.obtStore \
 -summary 0-overlaptrim/dissect.chimera.summary \
 -report  0-overlaptrim/dissect.chimera.report \
 > 0-overlaptrim/dissect.chimera.err 2>&1

The final step in Overlp Based Trimming is to examine overlaps for signals of chimeric or spur fragments. A chimeric fragment is formed from two disjoint pieces of genomic DNA, and a spur fragment has high-quality non-genomic DNA at either end.

It writes a 'summary' statistics file, and a detailed 'report' of chimera and spur detected. Detected chimera/spur are either corrected or the fragment is deleted from the assembly.

0-overlaptrim/dissect.chimera.summary

EXCEPT for the number of things fixed or deleted, these stats
are not meaningful, and likely wrong.

READS
total processed       469659
full coverage         175
gap not chimera       0
no chimeric ovl       466759

chimera fixed:        2638
chimera deleted small 60

spur fixed            21
spur deleted small    6

CHIMERA
from innie pair       2698
from overhang         0
from gap              0
from linker           0

0-overlaptrim/dissect.chimera.report

E8YURXS01EF8XF,14 CHIMERA Trimmed from    0  281 to  126  281.  Length OK, gatekeeper store updated.
E8YURXS01DW0YQ,19 CHIMERA Trimmed from    0  265 to   75  265.  Length OK, gatekeeper store updated.
.
.
E8YURXS01BIEY4,2184 CHIMERA Trimmed from    0  137 to    0   50.  New length too small, fragment deleted, gatekeeper store updated.
.
.
E8YURXS01EVHR8,3610 CHIMERA Trimmed from    0  262 to  122  262.  Length OK, gatekeeper store updated.
E8YURXS01D4316,3646 SPUR Trimmed from    0  272 to    0  219.  Length OK, gatekeeper store updated.
E8YURXS01EXMIY,3666 CHIMERA Trimmed from    0  269 to    0  103.  Length OK, gatekeeper store updated.
.
.
stopAfter = OBT

Overlap

Run overlap jobs

1-overlapper/overlap.sh 1 > 1-overlapper/overlap.0001.out 2>&1
.
.
.

This stage is nearly identical to 0-overlaptrim-overlap, with the exception that it is run in the '1-overlapper' directory, and that the clear range used is different.

Build an overlap store for assembly

find 1-overlapper \( -name \*ovb.gz -or -name \*ovb \) -print > dissect.ovlStore.list
bin/overlapStore \
  -c dissect.ovlStore.BUILDING \
  -g dissect.gkpStore \
  -i 0 -M 1024 -L dissect.ovlStore.list \
  > dissect.ovlStore.err 2>&1
stopAfter = overlapper

Overlap error correction

Overlap error correction uses a two pass algorithm to correct the reported error rate in overlaps for sequencing errors.

The first pass, fragment error correction, reads all overlaps for a specific fragment, generates a multialignment (very similar to an Overlap Store Picture), then analyzes the multialignment for patterns that indicate a sequencing error. Any errors found are saved to disk.

The second pass, overlap error correction, recomputes each overlap after correcting fragments for these errors. Only the error rate of the overlap is changed, in particular, the changes made to fragments are NOT saved.

There are no user-viewable files generated in this stage.

Examine fragment multialigns for errors

3-overlapcorrection/frgcorr.sh 1 > 3-overlapcorrection/0001.err 2>&1
3-overlapcorrection/frgcorr.sh 2 > 3-overlapcorrection/0002.err 2>&1
bin/cat-corrects \
  -L 3-overlapcorrection/cat-corrects.frgcorrlist \
  -o 3-overlapcorrection/dissect.frgcorr \
  > 3-overlapcorrection/cat-corrects.err 2>&1

Apply error corrections to overlaps

3-overlapcorrection/ovlcorr.sh 1 > 3-overlapcorrection/0001.err 2>&1
3-overlapcorrection/ovlcorr.sh 2 > 3-overlapcorrection/0002.err 2>&1
bin/cat-erates \
  -L 3-overlapcorrection/cat-erates.eratelist \
  -o 3-overlapcorrection/dissect.erates \
  > 3-overlapcorrection/cat-erates.err 2>&1

Update the overlap store with corrected error rates

bin/overlapStore \
  -u dissect.ovlStore \
  3-overlapcorrection/dissect.erates \
  > 3-overlapcorrection/overlapStore-update-erates.err 2>&1

Construct unitigs

bin/buildUnitigs \
  -O dissect.ovlStore \
  -G dissect.gkpStore \
  -T dissect.tigStore \
  -B 150000 \
  -e 0.03 \
  -E 0 \
  -b \
  -m 7 \
  -U \
  -o 4-unitigger/dissect > 4-unitigger/unitigger.err 2>&1
stopAfter = unitigger

The BOG unitigger writes several interesting files; the complete best overlap graph, overlaps that intersected unitigs, overlaps that were not used in any unitig, a collection of statistics, and diagnostic output.

The best overlap graph is contained in two files, one showing the best dovetail overlaps, the other showing the best containment overlaps.

best.contains

#fragId libId   mated   bestCont
2       1       f       9065
4       1       f       5311
9       1       f       182970
11      1       f       32029
14      1       f       4505
19      1       f       17586
20      1       f       30626
21      1       f       22682
22      1       f       39583

The best.contains file tells us that fragment IID 2 in library IID 1 is not mated, and is contained in fragment 9065.

best.edges

#fragId libId   mated   best5   best3
1       1       142394  5'      183004  3'
3       1       29299   5'      71998   3'
6       1       68447   3'      86073   3'
7       1       199988  3'      54667   5'
12      1       29021   3'      174989  5'
18      1       173870  3'      0       5'
23      1       82096   3'      75807   5'
26      1       336148  3'      169224  3'
33      1       4232    5'      2236    3'

(The header is incorrect. There is no 'mated' column in this data.)

The best.edges file tells us that fragment IID 1 in library IID 1 has a best edge off of its 5' end to the 5' end of fragment 142394, and a best edge off of its 3' end to the 3' end of fragment 183004:

<----------142394---------
           ------------1------------>
                               <-------183004-----------

dissect.breaks.ovl

{OVL
afr:48281
bfr:37794
ori:N
olt:D
ahg:9
bhg:3
qua:0.000000
mno:0
mxo:0
pct:0
}
.
.
.

File dissect.breaks.ovl contains a list of overlaps (in OVL message format, which will soon change) that caused a unitig intersection break.

File dissect.cga.0 contains several length and coverage statistics.

File dissect.iidmap contains a mapping from a BOG-internal unitig ID to the assembler IID. It is used for associating unitigs in consensus and CGW with unitigs in the unitigger.err diagnostic output.

Files dissect.partitioning and dissect.partitioningInfo contain the gatekeeper store partitioning used when computing consensus sequences. The next step in the pipeline will use these files to rewrite the gkpStore for easier loading by utgcns.

File dissect.unused.ovl is a list of overlaps (in OVL message format, which will soon change) that are 'best overlaps' but not used in any unitig.

File unitigger.err is the diagnostic output of the unitigger. Most of this output is disabled by default.

File unitigger.success informs runCA that the unitigger phase of the assembly has finished successfully.

Compute unitig consensus sequences

Partition the fragments for an efficient compute

bin/gatekeeper -P 4-unitigger/dissect.partitioning dissect.gkpStore \
  > 5-consensus/dissect.partitioned.err 2>&1

This reads a partitioning file, here from 4-unitigger/dissect.partitioning, and rewrites the fragment information in dissect.gkpStore to increase locality. All fragments needed by a single unitig consensus job are plaed in a single partition file.

Compute Consensus Sequences

5-consensus/consensus.sh 1 > /dev/null 2>&1
5-consensus/consensus.sh 2 > /dev/null 2>&1

Compute consensus sequences for a set of unitigs. Each shell script computes one consensus partition. Each partition creates a diagnostic output file of "dissect_###.err", and if it finished successfully, "dissect_###.success". Errors, if any, will be reported in the diagnostic output. See # Unitig Consensus Failures in CA 6 for details.

stopAfter = utgcns


Scaffolding

Estimate insert sizes

On smaller assemblies, an initial scratch scaffolding is computed to estimate the insert sizes. The scaffolding is discarded, and the insert sizes are updated in the gatekeeper store. This can be enabled for large assemblies, or disabled on small assemblies with the computeInsertSize option.

bin/cgw  \
  -j 1 -k 5 \
  -r 5 \
  -s 2 \
  -S 0 \
  -G \
  -z \
  -P 2 \
  -B 150000 \
  -m 100 \
  -g dissect.gkpStore \
  -t 6-clonesize/dissect.tigStore \
  -o 6-clonesize/dissect \
  > 6-clonesize/cgw.out 2>&1

The insert size estimates and actual values used to generate those estimates are in 6-clonesize/stat/*distlib*cgm. Four estimates are generated, two at the start of the program are based on unitigs, and two at the end of the program are based on contigs and scaffolds.

  • 6-clonesize/stat/unitig_preinitial.distlib_1.cgm: estimates from unitigs, before splitting potentially chimeric unitigs
  • 6-clonesize/stat/unitig_initial.distlib_1.cgm: estimates from unitigs, after splitting potentially chimeric unitigs
  • 6-clonesize/stat/scaffold_final.distlib_1.cgm: estimates from scaffolds
  • 6-clonesize/stat/contig_final.distlib_1.cgm: estimates from contigs

The first line of each file contains the estimate, the remaining lines are distances sorted from shortest to longest.

lib 1 mu 3178.58 sigma 885.224
1252
1283
1289
1295
1298
1313
1372
1379
.
.
.

The same estimates are also in 6-clonesize/stat/*distupdate.dst. These are used by gatekeeper to update the gkpStore. In this *distupdate.dst file, the estimate went from 2000 +- 666 to 1193 +- 243:

# LIB T13146 2000.000000 +- 666.000000 -> 1193.902002 +- 243.667820
{LIB
act:U
acc:T13146
mea:1193.902
std:243.668
}

For an estimate to be made, there must be at least 100 (default) mate pairs in the object being sampled (unitigs, contigs or scaffolds). This value can be modified using the runCA option cgwDistanceSampleSize.

The scaffold-based estimates are less precise than the contig-based estimates, but the contig estimates might not have enough data to generate an estimate. Gatekeeper will update with the scaffold-based estimate first, and if one exists, then update with the contig-based estimate.

bin/gatekeeper \
  -a -o dissect.gkpStore \
  -E 6-clonesize/gkp.distupdate.err \
  6-clonesize/stat/scaffold_final.distupdate.dst \
  6-clonesize/stat/contig_final.distupdate.dst \
  > 6-clonesize/cgw.distupdate.err 2>&1

Scaffolding

By default, there are five phases to scaffolding. Here, two have been disabled by setting doExtendClearRanges = 1.

bin/cgw  \
  -j 1 -k 5 \
  -r 5 \
  -s 0 \
  -S 0 \
  -G \
  -z \
  -P 2 \
  -B 50000 \
  -m 100 \
  -g dissect.gkpStore \
  -t dissect.tigStore \
  -o 7-0-CGW/dissect \
  > 7-0-CGW/cgw.out 2>&1

Like in 6-clonesize, the cgw program estimates insert sizes in </b>7-0-CGW/stat/*dist*</b>.

The sizes of scaffolds generated are reported in final0.SingleScaffolds.nodelength.cgm (for scaffolds with a single contig) and final0.Scaffolds.nodelength.cgm. Using awk we can determine the approximate N50 value.

First, we compute the sum of scaffold lengths and divide that by two.

% cat final0.SingleScaffolds.nodelength.cgm \
      final0.Scaffolds.nodelength.cgm | \
      sort -nr | \
      awk '{ sum += $1 } END { print sum/2 }'
1.12795e+06

Then scan the list for the first length that exceeds that.

% cat final0.SingleScaffolds.nodelength.cgm \
      final0.Scaffolds.nodelength.cgm | \
      sort -nr | \
      awk '{ cnt++; sum += $1; if (sum > 1.12795e+06) { print cnt,$1,sum; exit } }'
11 33009 1160469

The N50 value is 33,009 bases, and we needed 11 scaffolds to get there. Removing the 'if' test will show all values:

% cat final0.SingleScaffolds.nodelength.cgm \
      final0.Scaffolds.nodelength.cgm | \
      sort -nr | awk '{ cnt++; sum += $1; print cnt,$1,sum; }' | \
      head -20
1 745056 745056
2 77799 822855
3 42417 865272
4 40155 905427
5 39104 944531
6 38716 983247
7 38691 1021938
8 37071 1059009
9 35122 1094131
10 33329 1127460
11 33009 1160469
12 32145 1192614
13 27457 1220071
14 26979 1247050
15 26167 1273217
16 26107 1299324
17 25969 1325293
18 24570 1349863
19 22971 1372834
20 22640 1395474

After cgw generates contigs and scaffolds, we try to close intra-scaffold gaps by extending the clear range of some fragments through the gap. The eCR program also more aggressively closes short gaps.

The eCR program, like cgw before, is expensive, and not parallel. It examines every gap in every scaffold. Instead of one large batch, we break the computation into by default 4 batches. At the end of each batch, the work is saved in a 'checkpoint' file. extendClearRangesPartition examines the scaffolds to determine how many scaffolds are in each batch.

bin/extendClearRangesPartition \
  -g dissect.gkpStore \
  -t dissect.tigStore \
  -n 7 \
  -c dissect \
  -N 4 \
  -p 7-1-ECR/extendClearRanges.partitionInfo \
  > 7-1-ECR/extendClearRanges.partitionInfo.err 2>&1

Even though we requested four batches, only two were generated.

7-1-ECR/extendClearRanges-scaffold.0000017.sh
7-1-ECR/extendClearRanges-scaffold.0000388.sh

Each batch writes diagnostic output to files of similar name, in this case, extendClearRanges-scaffold.0000017.err and extendClearRanges-scaffold.0000388.err. At the end of each diagnostic file are statistics on the number and size of gaps closed.

% cat extendClearRanges-scaffold.0000017.err
.
.
.
                  numGaps: 19
        numExtendableGaps: 19 (left: 0, right: 1, both: 0)
            numGapsClosed: 7
             numSmallGaps: 12 (closed 7, 58.33%)
             numLargeGaps: 7 (closed 0, 0.00%)
                  allGaps: 19 (closed 7, 36.84%)
         maxGapSizeClosed: 0.00 (gap number -1) 
       numGapsVarTooSmall: 0
        numClosingsTried : 21
unitigMultiAlignFailures : 0
replaceEndUnitigFailures : 0
   unitigToContigFailures: 0
    createAContigFailures: 0
           noOverlapFound: 14
            avgOlapLength: 188.57
         stddevOlapLength: nan
             avgOlapDiffs: 0.14
          surrogatesOnEnd: 0

             CONTIG_BASES: 1000
   totalContigsBaseChange: -1320
   totalBasesInClosedGaps: -1320
     scaffold base change: 0

In this batch, there were 19 intra-scaffold gaps, all of which met the criteria for having at least one fragment extension tested. 12 of the gaps were small (hardcoded, less than 100bp) and 7 were large. Of the 12 small gaps, 7 were closed.

The number of closings tried is more than the number of gaps because there are up to four ways to close any gap: extending a fragment from the contig on the left, extending a fragment from the contig on the right, extending both fragments, or extending no fragment.

When a gap is closed, two contigs are merged into one. We can count the number of contigs in the tigStore (tigStore version numbers are the '-n' option to extendClearRanges, in each of the extendClearRanges*sh file).

% tigStore -g test.gkpStore -t test.tigStore 7 -D contiglist | wc -l
    7624
% tigStore -g test.gkpStore -t test.tigStore 8 -D contiglist | wc -l
    7631
% tigStore -g test.gkpStore -t test.tigStore 9 -D contiglist | wc -l
    7645

Version 7 is before the first batch of eCR. Version 8 is after the first batch, and we see 7 contigs were added -- exactly what was reported above in 'extendClearRanges-scaffold.0000017.err'.

After 'extendClearRanges', we again rebuild scaffolds.

As this is the last 'cgw' invocation, it also 'throws stones' into gaps, and places any unambiguous fragments in surrogates into the scaffold. It will also create a partitioned tigStore for input to consensus.

Under default parameters, this would be the '7-4-CGW' step, not '7-2-CGW' as here. We limited this assembly to 'one round' of extendClearRanges instead of the normal two.

bin/cgw -R 10 -N ckp01-ABS \
  -j 1 -k 5 \
  -r 5 \
  -s 2 \
  -z \
  -P 2 \
  -B 50000 \
  -m 100 \
  -g dissect.gkpStore \
  -t dissect.tigStore \
  -o 7-2-CGW/dissect \
  > 7-2-CGW/cgw.out 2>&1
stopAfter = scaffolder

Compute contig consensus sequences

Like with unitig consensus, we partition the gkpStore to localize fragments in contigs.

Partition fragments

bin/gatekeeper -P 7-CGW/dissect.partitioning dissect.gkpStore \
  > 8-consensus/dissect.partitioned.err 2>&1

Compute consensus sequences

8-consensus/consensus.sh 1 > /dev/null 2>&1
8-consensus/consensus.sh 2 > /dev/null 2>&1
stopAfter = ctgcns

Terminate

After consensus sequences are computed, all that is left to do is write the outputs. The 'terminator' command reads the gkpStore, tigStore and final CGW checkpoint (soon to be the scfStore) and outputs the ASM file and a mapping from the internal IID (as used during assembly) to the external UID (as listed in the ASM file) for each fragment, unitig, contig and scaffold.

bin/terminator \
  -g dissect.gkpStore \
  -t dissect.tigStore 20 \
  -c 7-CGW/dissect 19 \
  -o 9-terminator/dissect \
  > 9-terminator/dissect.asm.err

FASTA Files are generated from the ASM file, and singletons -- not listed in the ASM file -- are discovered from the tigStore and CGW checkpoint.

bin/asmOutputFasta \
  -p 9-terminator/dissect 9-terminator/dissect.asm \
  > 9-terminator/asmOutputFasta.err 2>&1
bin/dumpSingletons \
  -g dissect.gkpStore \
  -t dissect.tigStore \
  -c 7-CGW/dissect \
  -n 19 \
  -S \
  > 9-terminator/dissect.singleton.fasta \
  2> 9-terminator/dumpSingletons.err 

The POSMAP files are generated, also from the ASM file.

bin/buildPosMap -o dissect \
  < 9-terminator/dissect.asm \
  > 9-terminator/buildPosMap.err 2>&1

Finally, the QC Metrics are generated. At the bottom of the metrics is a histogram of the depth of scaffolds. This histogram is generated from the POSMAP.

sort -k2n -k3n \
  -T 9-terminator 9-terminator/dissect.posmap.frgscf \
  > 9-terminator/dissect.posmap.frgscf.sorted \
&& \
bin/fragmentDepth \
  -min       0 -max    3000 \
  < 9-terminator/dissect.posmap.frgscf.sorted \
  > 9-terminator/dissect.posmap.frgscf.histogram1 \
&& \
bin/fragmentDepth \
  -min    3001 -max   10000 \
  < 9-terminator/dissect.posmap.frgscf.sorted \
  > 9-terminator/dissect.posmap.frgscf.histogram2 \
&& \
bin/fragmentDepth \
  -min   10001 -max 1000000 \
  < 9-terminator/dissect.posmap.frgscf.sorted \
  > 9-terminator/dissect.posmap.frgscf.histogram3 \
&& \
bin/fragmentDepth -min 1000001 \
  < 9-terminator/dissect.posmap.frgscf.sorted \
  > 9-terminator/dissect.posmap.frgscf.histogram4 
/usr/bin/env perl bin/caqc.pl -euid  9-terminator/dissect.asm