Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly

Katrin Sameith; Juliana G Roscito; Michael Hiller

doi:10.1093/bib/bbw003

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly

Publikation: Beitrag in Fachzeitschrift › Forschungsartikel › Beigetragen › Begutachtung

Beitragende

Katrin Sameith - , Max-Planck-Institut für molekulare Zellbiologie und Genetik (Autor:in)
Juliana G Roscito - , Max-Planck-Institut für molekulare Zellbiologie und Genetik (Autor:in)
Michael Hiller - (Autor:in)

Abstract

Next-generation sequencers such as Illumina can now produce reads up to 300 bp with high throughput, which is attractive for genome assembly. A first step in genome assembly is to computationally correct sequencing errors. However, correcting all errors in these longer reads is challenging. Here, we show that reads with remaining errors after correction often overlap repeats, where short erroneous k-mers occur in other copies of the repeat. We developed an iterative error correction pipeline that runs the previously published String Graph Assembler (SGA) in multiple rounds of k-mer-based correction with an increasing k-mer size, followed by a final round of overlap-based correction. By combining the advantages of small and large k-mers, this approach corrects more errors in repeats and minimizes the total amount of erroneous reads. We show that higher read accuracy increases contig lengths two to three times. We provide SGA-Iteratively Correcting Errors (https://github.com/hillerlab/IterativeErrorCorrection/) that implements iterative error correction by using modules from SGA.

Details

Originalsprache	Englisch
Seiten (von - bis)	1-8
Seitenumfang	8
Fachzeitschrift	Briefings in bioinformatics
Jahrgang	18
Ausgabenummer	1
Publikationsstatus	Veröffentlicht - Jan. 2017
Peer-Review-Status	Ja
Extern publiziert	Ja

Externe IDs

PubMedCentral	PMC5221426
Scopus	85015830340
ORCID	/0000-0003-4306-930X/work/141545247
ORCID	/0000-0003-1494-1162/work/142255068

Schlagworte

Schlagwörter

Algorithms, High-Throughput Nucleotide Sequencing, Sequence Analysis, DNA

Forschungsportal der TU Dresden

Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly

Beitragende

Abstract

Details

Externe IDs

Schlagworte

Schlagwörter

Bibliotheksschlagworte