Correction of third generation sequencing errors

Authors: Patryk Pankiewicz, Wiktor Kuśmirek, Robert Nowak

The next generation sequencing is the main source of DNA data. One of the problem related to DNA is finding similarities between different species. The algorithms dedicated for this purpose are based on dynamic programming (Needleman-Wunsch, Smith-Waterman) - they give a good measure of similarity, but are not efficient for big data sets.

In this study we present the new algorithm based on common parts of reads. The approach can handle all types of sequencing errors - insertions, deletions and replacements - and its result is similar to other well known applications.

The presented algorithm is implemented in C++ and Boost library and uses threads for parallel computing. It is a part of the dnaasm application, source code as well as demo application is available at project homepage: http://dnaasm.sourceforge.net.

Author: Patryk Pankiewicz
Conference: Title