GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments

Bioinformatics. 2015 Dec 1;31(23):3733-41. doi: 10.1093/bioinformatics/btv465. Epub 2015 Aug 10.

Abstract

Motivation: Genome assemblies generated with next-generation sequencing (NGS) reads usually contain a number of gaps. Several tools have recently been developed to close the gaps in these assemblies with NGS reads. Although these gap-closing tools efficiently close the gaps, they entail a high rate of misassembly at gap-closing sites.

Results: We have found that the assembly error rates caused by these tools are 20-500-fold higher than the rate of errors introduced into contigs by de novo assemblers. We here describe GMcloser, a tool that accurately closes these gaps with a preassembled contig set or a long read set (i.e., error-corrected PacBio reads). GMcloser uses likelihood-based classifiers calculated from the alignment statistics between scaffolds, contigs and paired-end reads to correctly assign contigs or long reads to gap regions of scaffolds, thereby achieving accurate and efficient gap closure. We demonstrate with sequencing data from various organisms that the gap-closing accuracy of GMcloser is 3-100-fold higher than those of other available tools, with similar efficiency.

Availability and implementation: GMcloser and an accompanying tool (GMvalue) for evaluating the assembly and correcting misassemblies except SNPs and short indels in the assembly are available at https://sourceforge.net/projects/gmcloser/.

Contact: shunichi.kosugi@riken.jp.

Supplementary information: Supplementary data are available at Bioinformatics online.

Publication types

  • Research Support, Non-U.S. Gov't

MeSH terms

  • Contig Mapping / methods*
  • High-Throughput Nucleotide Sequencing / methods*
  • Likelihood Functions
  • Sequence Alignment*
  • Sequence Analysis, DNA / methods*
  • Software*