Implementation of a hybrid algorithm for generating scaffolds

The second generation sequencing methods produce high-quality short reads, which are assembled into contigs
by DNA assemblers. Due to the fact that length of a single read is limited to 500bp, the contigs are far shorter
than chromosomes. Generating longer contigs is a main effort of computer scientists in this area.

The most popular approach for contig extension is to use pair-end tags, however this method does not always
produce best possible results because distances between two paired reads are limited to 8kbp. An another,
recently published approach appears to be free of this disadvantage. Contig linking is performed using reads of
lengths reaching 20kbp. The above-mentioned reads can be obtained as a result of third-generation sequencing.
An existing implementation of this algorithm appears to be time and memory demanding for larger genomes.

We propose a new optimization of this algorithm based on Bloom filter and extremely memory-efficient
associative array. Our improved implementation remarkably exceeds the previous one in terms of time and
memory consumption. The updated algorithm has been tested on real data of bacteria, yeast and plant genomes.

The optimized algorithm, provided as a library, is a part of the dnaasm de novo assembler. The library has
been created using C++ programming language, Boost and Google Sparse Hash libraries. Both web browser-
based graphical user interface and command line interface are provided. Source code as well as a demo web
application and a docker image are available at the dnaasm project web-page:

http://dnaasm.sourceforge.net.

Author: Wiktor Franus
Conference: Title