Version 7.0 Changes

From wgs-assembler
Jump to: navigation, search

This is a nearly complete list of the changes made in Celera Assembler 7.0 since the last release (Celera Assembler 6.0). See also Version 7.0 Release Notes.

BROKEN COMMITS

(none known yet)

WGS-ASSEMBLER CHANGES

This list is derived from the CVS commit logs. Some logs are omitted (those that are describing code changes with no real impact). This list is technical. For a more readable list, see Version 7.0 Release Notes.

  1. 2010/06/02 -- (gatekeeper) - Fix crash when reads longer than 2048 bases were encountered.
  2. 2010/06/17 -- (sffToCA) - Fix crash when '-clear pair-of-n' encountered a read starting with a pair of N's. SF bug 3017755.
  3. 2010/06/18 -- (BOG) - Reduce frequency of crashing during mate based splitting on hybrid Illumina assemblies. During splitting, try to split only on non-contained fragments.
  4. 2010/07/01 -- (convert-fasta-to-v2) - update flags when specifying 454 reads to those output by sffToCA
  5. 2010/07/12 -- (merTrim) - Avoid going out of bounds when setting the start of the clear range on a fragment
  6. 2010/07/13 -- (gatekeeper) - Truncate Illumina fragments to maximum length if they exceed it.
  7. 2010/08/06 -- (overlapStore) - Fix crash when there are fewer overlaps than fragments.
  8. 2010/08/06 -- (terminator) - Allow disabling of fakeUIDs. SF bug #3018126.
  9. 2010/08/06 -- (tigStore) - Don't attempt to print empty multialigns.
  10. 2010/08/11 -- (runCA) - Fix 'false crash' when checking if olap-from-seeds jobs finished. Even though all jobs finished, runCA would think some jobs failed.
  11. 2010/08/16 -- (general) - Fix problems reading large messages. This typically showed up as a crash in the post-terminator steps on large assemblies with deep unitig coverage.
  12. 2010/09/03 -- (convertOverlap) - Simplify usage; command line options changed.
  13. 2010/09/07 -- (gatekeeper) -When dumping FastQ, fix a bug where both mated reads were labeled 'dir=F'. Now, one is F and the other is R, though the label is arbitrary.
  14. 2010/09/20 -- (tigStore) - Add option -B to build a new tigStore from a layout dump.
  15. 2010/09/23 -- (BOG) - Bug in computing the bad mate interval. This changes mate-based splitting.
  16. 2010/09/23 -- (CGW) - Check that all fragments are present in unitigs and that mates are also present when adding a graph edge.
  17. 2010/09/23 -- (utgcns) - Check that option '-t name vers part' has valid 'vers' and 'part' values.
  18. 2010/09/24 -- (BOG,CGW) - Be more strict when assigning output unitigs to consensus partitions.
  19. 2010/09/24 -- (BOG) - Change in filtering of intersection break points.
  20. 2010/09/25 -- (BOG) - The length of a cycle in the Best Overlap Graph was off by one. Add debug output of the BOG path.
  21. 2010/09/28 -- (BOG) - Mate based splitting changes due to bug fix.
  22. 2010/09/28 -- (BOG) - Add -D debug output flags.
  23. 2010/09/28 -- (overlapStore) - Add -G option to implement an old algorithm for computing the size of the genome based on overlaps.
  24. 2010/09/28 -- (BOG) - More aggressively pop intersection bubbles.
  25. 2010/09/29 -- (BOG) - An incorrect test ('fragId' instead of 'fragId,fragEnd') was leading to incorrect path lengths in some unitigs with intersections.
  26. 2010/09/30 -- (gatekeeper) - Allow blank lines in the input frg file.
  27. 2010/10/01 -- (BOG) - Skip fragments in non-existent library 0.
  28. 2010/10/01 -- (BOG) - Fix memory leak during output of final unitigs.
  29. 2010/10/01 -- (BOG) - Fix memory leak during unitig construction.
  30. 2010/10/01 -- (overlapStore) - Allow any number of input overlap files. Previously limited to about 10,000.
  31. 2010/10/01 -- (runCA) - Rename intermediate overlap file names from 'h####r####' to be named after the job index.
  32. 2010/10/04 -- (BOG) - Correct a flaw in the placement of fragments. This exhibited itself only during bubble popping, and resulted in false negatives (bubbles not popped).
  33. 2010/10/05 -- (gatekeeper) - Change the text of the message when adding a library that already exists from an 'error' to an 'alert'.
  34. 2010/10/06 -- (CGW) - Resolve rare infinite loop during scaffold merging. CGW was merging two small scaffolds with 2 and 1 contigs, then ejecting the 1 contig to a new scaffold. It failed to notice that no merge actually occurred, and repeated the process.
  35. 2010/10/08 -- (BOG) - Fix a rare assert when we fail to find the correct split point. Now we'll just split at the slightly incorrect split point.
  36. 2010/10/09 -- (fragment correction) - Potentially fix crash when there are more than 2 billion overlaps in a single partition.
  37. 2010/10/09 -- (BOG) - Add utility for analyzing best.edges and best.contains and report the number of spurs etc per library.
  38. 2010/10/12 -- (BOG) - Check for mates to deleted fragments, and break the mate. Gatekeeper is NOT updated, only the mate in BOG is broken.
  39. 2010/10/12 -- (BOG) - Also output 'best.singletons'.
  40. 2010/10/25 -- (gatekeeper) - If an Illumina read is longer than AS_READ_MAX_PACKED_LEN, promote (demote?) it to a 'normal' read. Do not truncate it. Log a warning ("Alert") which will result in a very large log file if all reads are larger than the max.
  41. 2010/10/26 -- (gatekeeper) - Fix reader bug where reading a line with 1022 characters plus an endline caused us to miss the end of the line
  42. 2010/10/27 -- (BOG) - Report computed insert sizes to logFile.
  43. 2010/11/02 -- (merTrim) - Use generic kMer instead of specific kMerTiny. Should allow merTrim on large kmers, but not supported by runCA.
  44. 2010/11/05 -- (buildPosMap) - Fix crash on assembles with more than 32 million contigs.
  45. 2010/11/09 -- (BOG) - When merging unitigs, assert(0) if it fails to merge. This does occur, but the fix is too expensive given that bubble popping has already been rewritten (but not tested enough to be in CVS).
  46. 2010/11/16 -- (gatekeeper) - Load the TNT / contamination clear range from FRG files. This is used during chimera trimming to remove suspected linker. Since sffToCA was split out of gatekeeper (Oct 23 2008) chimera has been unable to do this trimming.
  47. 2010/11/17 -- (sffToCA) - Be more permissive when creating mate pairs. Replace existing heuristics (absolute values of errors, length) with percent identity and percent coverage thresholds for deciding when a linker alignment is good.
  48. 2010/11/19 -- (gatekeeper) - Fix another crash when loading fragments of a specific length.
  49. 2010/11/26 -- (runCA) - Add option ovlHashBlockLength, to set the block size based on fragment length instead of number of fragments. Disable a diagnostic output in meryl.pl.
  50. 2010/11/29 -- (overlapStore) - Filter OBT overlaps more aggressively. This is currently limited to the duplicate overlaps placed in the dupStore.
  51. 2010/12/02 -- (fragmenmt correction) - When loading fragments, print out how many fragments are loaded - useful for batch size that fits in some specific memory size.
  52. 2010/12/08 -- (CGW) - Add new metagenomic options to CGW to ignore 'buried' links when merging scaffolds and to shatter scaffolds on initial load from checkpoint.
  53. 2010/12/08 -- (toggle) - Allow toggling by surrogate size only, ignoring number of instances.
  54. 2010/12/13 -- (toggle) - Resolve failure with cleanup=aggressive and doToggle=1. Cleanup was occurring before toggling, which removed the data stores, and toggling failed.
  55. 2010/12/13 -- (gatekeeper) - Fix another crash when loading fragments of a specific length (persistent little bugger).
  56. 2011/01/03 -- (overlapStore) - Fix integer overflow with large memory sizes when counting the number of overlaps in a batch. Add -F option to set the batch size based on an explicit limit on the number of batch files, instead of guessing a memory size.
  57. 2011/01/03 -- (utgcns,ctgcns) - Strongly type some integer indexes to prevent overflow (and then crashing) when working on VERY deep unitigs.
  58. 2011/01/03 -- (gatekeeper) - Add library feature to request mer based trimming.
  59. 2011/01/03 -- (fastqToCA) - Set doMerBasedTrimming flag.
  60. 2011/01/03 -- (gatekeeper) - Remove -E option (error log location). It is now always written to (gkpStore).errorLog.
  61. 2011/01/03 -- (gatekeeper) - Write a mapping from CA UID to illumina name to (gkpStore).illluminaUIDmap.
  62. 2011/01/04 -- (sffToCA) - Don't assert for reads that are too short to be dedup'd, just silently skip them. Usually never occurs, until one mucks with AS_READ_MIN_LEN.
  63. 2011/01/04 -- (gatekeeper) - Convert invalid Illumina bases (like '.') into 'N' with minimum QV, and do this before trimming off low-qv ends. The previous behavior was to discard such reads.
  64. 2011/01/06 -- (gatekeeper) - Increase to version 7. Split the sequence data and the meta data in the PACKED read type, to allow loading of all meta data into memory. Increase the maximum length allowed from 104 to 136 bases.
  65. 2011/01/25 -- (dedub) - Remove stray paren in logging of 'DUPof' events.
  66. 2011/01/25 -- (sffToCA) - slight performance increase by caching fragment metadata.
  67. 2011/01/25 -- (runCA) - merge individual pieces into a single executable. LIne numbers in the bin/ version are now the same as in the CVS/ version.
  68. 2011/01/25 -- (CGW,eCR) - Fix crash on invalid pointer (caused by RecomputeOffsetsInScaffold() occasionally reallocating the list of nodes in the graph).
  69. 2011/01/26 -- (general) - Add runtime support for setting the minimum fragment length and minimum overlap size.
  70. 2011/02/11 -- (gatekeeper) - Bug in handling of -outtie and -type. Type was used instead of outtie, resulting in reads of type SOLEXA being reverse complemented (e.g., treated as innie) and -outtie completely ignored in all cases.
  71. 2011/02/11 -- (sffToCA) - Multiple -linker with a custom sequence were using only the last -linker sequence; the search for linker would stop when any linker match was found, not necessarily the best match. This wasn't a problem with the standard linkers, but showed up with a custom linker where the forward and reverse versions were similar -- we'd find a partial match to the other strand and stop looking.
  72. 2011/02/11 -- (inputs) - Remove obsolete convert-solexa-to-v2.pl (use fastqToCA instead).
  73. 2011/02/11 -- (gatekeeper) - Synchronize the code for dumping fasta, newbler and fastq to make them consistent.
  74. 2011/02/11 -- (gatekeeper) - Fix QV problems in fastq output.
  75. 2011/02/21 -- (fastqToCA) - Check that the supplied fastq files actually exist, and ensure that the absolute path is used in the frg file.
  76. 2011/02/22 -- (fastqSimulate) - Add a basic paired-end and mate-pair simulator for Illumina.
  77. 2011/02/24 -- (gatekeeper) - Don't fail on invalid QV's, replace them with the closest valid value. Likewise, replace invalid bases with 'N' and a low QV.
  78. 2011/02/24 -- (gatekeeper) - Add a utility to upgrade a v6 gkpStore to v7 (2011/01/06).
  79. 2011/02/24 -- (tigStore) - Add a script to delete a unitig from the tigStore, updating gatekeeper to reflect the loss of fragments/mates.
  80. 2011/03/08 -- (gatekeeper) - Don't edit the clear range if the one we are supplied is invalid.
  81. 2011/03/08 -- (toggle) - Fix crash when running toggling on assemblies with packed gatekeeper fragments (usually Illumina).
  82. 2011/03/15 -- (mercy) - Use the real mode, not hardcoded 8. Still needs work to remove a hack to not pick the high-count low-threshold from error mers.
  83. 2011/03/16 -- (runCA) - Fix 'Undefined subroutine &main::caError' error.
  84. 2011/03/17 -- (CGW) - Allow missing (deleted) unitigs in the tigStore.
  85. 2011/03/31 -- (overlapStore) - Fix a crash when the memory size is too small or the number of buckets is too large and we run out of open files.
  86. 2011/03/31 -- (gatekeeper) - Fix crash when trimming junk from Illumina reads.
  87. 2011/03/31 -- (overlapStore) - Large read bug fixes. Up to 16,384bp reads have been tested.
  88. 2011/04/02 -- (overlapStore) - Fix an error when finding overlaps to dump.
  89. 2011/04/04 -- (general) - Increase the number of elements stored in the heap from 2 billion to however many billions we get from a 64-bit integer.
  90. 2011/04/05 -- (fastqToCA) - Add -interleaved to fastqToCA, and support for interleaved fastq files to gatekeeper.
  91. 2011/04/08 -- (gatekeeper) - Correct errors in dumpFastA: -allbases was not printing all bases; the mate UID was not being initialized correctly, resulting in the previous mate UID reported for fragments with no mate.
  92. 2011/04/22 -- (fastqSimulate) - Don't attempt to make a mate pair of the fragment is too small.
  93. 2011/04/28 -- (gatekeeper) - Fix crash when dumping fasta, newbler or fastq for fragments in library 0.
  94. 2011/05/02 -- (terminator) - Make checkpoint optional; if no checkpoint is provided, only UTG records are output.
  95. 2011/05/23 -- (gatekeeper) - Clarify diagnostic output about the QV encoding. Also report the IID in the UID to Name map. Fix error when loading interleaved reads.
  96. 2011/05/23 -- (gatekeeper) - Reset the line number on each new file loaded in gatekeeper.
  97. 2011/06/02 -- (overlapStore) - Dumping overlaps with -b and -e both equal to the last fragment in the store did not report all overlaps.
  98. 2011/06/03 -- (OBT) - Consolidate the various 'final overlap based trimming' algorithms into one program (finalTrim). To do this, the library features for trimming needed to be changed. This invalidates existing frag files -- but gatekeeper is smart enough to promote the old names to new names. One exception: the old fastqToCA did NOT enable dedup of illumina, and this cannot be promoted.
  99. 2011/06/06 -- (runCA) - Add 'mbtThreads' option to set number of threads in mer trimming.
  100. 2011/06/13 -- (runCA) - Replace the perl implementation of 'overlap configuration' with a much much faster C implementation.
  101. 2011/06/14 -- (gatekeeper) - Change the default 'frg' dump format from legacy version 1 to latest version 2. Remove '-format2' option, replace with '-legacyformat'.
  102. 2011/06/23 -- (merTrim) - Fix end effects in computing mer coverage. Remove gatekeeper update support.
  103. 2011/06/23 -- (OBT) - Glaring error in finding the largest covered region - always defaulted to the original trim points (introducted 2011/06/03).
  104. 2011/06/24 -- (gatekeeper) - Add 'clearRangeHistogram', a utility or plotting a histogram of the begin/end clear range.
  105. 2011/06/25 -- (gatekeeper) - Change the dumpinfo format to include the libIID, and first and last read IID.
  106. 2011/07/05 -- (fastqSimulate) - Add read type '-se' for single ended reads.
  107. 2011/07/07 -- (overlapConfig) - One too many fragments in hash block size resulting in an extra pass through the ref fragments.
  108. 2011/07/08 -- (merTrim) - Allow corrections when there are conflicting choices if exactly one choice is perfect.
  109. 2011/07/11 -- (gatekeeper) - Allow reading fragments from stdin.
  110. 2011/07/11 -- (gatekeeper) - Parse clear ranges from the fastq ID line. Correct logging of single-ended reads. Disable warning about reads longer than packed_length.
  111. 2011/07/19 -- (runCA) - Fix loading of frg files with '=' in the name from the spec file.
  112. 2011/07/20 -- (OBT) - Remove UID references from log files.
  113. 2011/07/21 -- (merTrim) - Detect chimeric fragments.
  114. 2011/07/25 -- (gatekeeper) - Increase from version 7 to version 8. Add a 'name' for each library in the store.
  115. 2011/07/25 -- (gatekeeper) - Add a utility to upgrade a v7 gkpStore to v8.
  116. 2011/07/26 -- (general) - Fixes to allow more than 1 billion reads.
  117. 2011/07/29 -- (BOG) - Fix crash when a unitig is larger than 1 Mbp.
  118. 2011/07/29 -- (gatekeeper) - Fail early if the fastq file doesn't exist.
  119. 2011/07/30 -- (overlap) - New 'ovm' overlapper, refactored from original overlapper.
  120. 2011/08/02 -- (overlap) - Make the HUIGE_TABLE_VERSION the default. Deprecate ovlMemory in favor of ovlHashBits and ovlHashBlockLength (to return when tuned). Replace ovlHashBlockSize with ovlHashBlockLength. Add overlapper ovm (AS_OVM) for testing.
  121. 2011/08/03 -- (fastqSimulate) - Don't allow reads or fragments to span or touch gaps in the reference sequence. Position the junction location uniformly instead of gaussian.
  122. 2011/08/03 -- (overlap) - Allow building of LARGE hash tables.
  123. 2011/08/08 -- (unitigger) - When loading bubble overlaps, warn and ignore suspicious overlaps. Previous behavior was to assert.
  124. 2011/08/15 -- (fastqSimulate) - Allow creation of a pure MP read set, no junction reads and no PE reads.
  125. 2011/08/19 -- (sffToCA) - Fix a sign flip resulting in an incorrect 'right fragment' length being computed. Impact: when a fragment has a begin clear range, the length of the right fragment was computed too small, and might result in the loss of the fragment.
  126. 2011/08/22 -- (runCA) - Stop reporting runCA options to stderr, instead, write them to runCA-logs/ in the asm directory.
  127. 2011/08/22 -- (general) - Improve error when gkpStore can't be opened - if store is read only, report "read-only" instead of "not found".
  128. 2011/08/22 -- (gatekeeper/OBT) - For the TNT / TAINT clear range, do not initialize to the current clear range. Initialize to the invalid clear range. Sanger reads do not use TNT. 454 reads use TNT, but were (correctly) forced to initialize to invalid. Illumina reads set TNT during merTrimApply, which would (incorrectly) initialize to the current clear range.
  129. 2011/08/24 -- (terminator) - Support more than 1 billion reads.
  130. 2011/08/24 -- (runCA) - Change mechanics of computeInsertSize. Now, if the value is not set, it will compute the insert size if fewer than 1m fragments. Otherwise, the set value (0 or 1) is used.
  131. 2011/08/29 -- (gatekeeper) - Add script (frg-to-fastq.pl) to convert FRGv2 (v1 not supported) to fastq.
  132. 2011/08/30 -- (gatekeeper) - Add generic fastq support and remove Illumina specific support. This changes the LIB features, and older fastqToCA outputs will no longer work.
  133. 2011/09/03 -- (CGW) - Fix crash on invalid pointer.
  134. 2011/09/04 -- (general) - Fixes for supporting up to 4 billion reads.
  135. 2011/09/13 -- (runCA) - Fail if 'ovlMemory' is supplied, but suggest the correct options to use instead.
  136. 2011/10/27 -- (utgcns) - Fix (again) for unitigs larger than 1Mbp.
  137. 2011/11/11 -- (overlapper) - Instrument InitializeWorkArea() to report how much memory was allocated.
  138. 2011/11/17 -- (utgcns) - Use a simple, but slow, dynamic programming based alignment instead of the faster heuristics.
  139. 2011/11/29 -- (runCA) - Add a post-process to resolve unitig consensus errors. Defines tigStore version 3 as the fixed consensus.
  140. 2011/12/04 -- (CGW) - Fix support for deleted unitigs.
  141. 2011/12/04 -- (utgcns) - Fix bad alignments resulting in gaps near the start/end of a fragment.
  142. 2011/12/08 -- (tigStore) - Add options -w and -s for formatting the multialign print output.
  143. 2011/12/09 -- (tigStore) - Set the clear range to use when printing multialigns based on the object being printed.
  144. 2011/12/09 -- (utgcns) - Reimplement low level base calling, to fix a bug when deep columns were assigned QV=0.
  145. 2011/12/09 -- (runCA) - Propagate frgMinLen and ovlMinLen environment variables to sub scripts. These would fail when run on SGE.
  146. 2011/12/10 -- (utgcns) - Fix bug on deep columns where the base call defaulted to '-' instead of an actual base.
  147. 2011/12/12 -- (ctgcns) - Fix bug in detecting variation near the end of a unitig.
  148. 2011/12/13 -- (runCA) - Retire the original overlapper implementation, use the new one from 2011/07/30.
  149. 2011/12/15 -- (BOG) - Fix crash when bubble fails to pop. Introduced on 2010/11/09.
  150. 2011/12/16 -- (tigStore) - Add a mate-pair analysis.
  151. 2011/12/19 -- (CGW) - Move mate-based chimeric unitig detection and correction out of CGW into its own module. This defines tigStore versions 4 and 5.
  152. 2011/12/20 -- (runCA) - Add computeCoverageStat, program to recompute the coverage stat for all unitigs in the tigStore.
  153. 2011/12/21 -- (terminator) - Check that the consensus sequence length is as expected.
  154. 2011/12/22 -- (utgcns) - Change static array to vector to handle more then 2000x coverage.
  155. 2011/12/24 -- (classify) - Fix off-by-one in fragment iteration. This caused the last fragment in the assembly to not be modified (either deleted or have the mate removed).
  156. 2011/12/29 -- (CGW) - Change checkpoint data sizes by converting from 'float' to 'double'.
  157. 2011/12/30 -- (runCA) - Fix scripting error when computing unitig arrival rates - it was reading the wrong tigStore version.
  158. 2012/01/05 -- (toggle) - Read from the correct starting tigStore version (2011/12/19).
  159. 2012/01/05 -- (ctgcns) - Fix a crash when abutting unitigs. This resolves the crashes due to "Assertion 'apos < alen' failed".
  160. 2012/01/06 -- (convert-fasta-to-v2.pl) - Check for invalid vector clear ranges. SF feature request #3462442.
  161. 2012/01/09 -- (runCA) - The -options switch now shows the default values of options, not the current values. SF bug #3053080.
  162. 2012/01/15 -- (OBT) - Disable debugging output that was obscuring the chimera log output.
  163. 2012/01/15 -- (OBT) - Add a header line to the finalTrim log.
  164. 2012/01/15 -- (OBT) - Rename several log/report files to be more consistent. Logs are now in *.log and summaries of the computes are in *.summary

KMER CHANGES

(none posted)