Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction

David Laehnemann; Arndt Borkhardt; Alice Carolyn McHardy

doi:10.1093/bib/bbv029

Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction

Brief Bioinform. 2016 Jan;17(1):154-79. doi: 10.1093/bib/bbv029. Epub 2015 May 29.

Authors

David Laehnemann, Arndt Borkhardt, Alice Carolyn McHardy

Abstract

Characterizing the errors generated by common high-throughput sequencing platforms and telling true genetic variation from technical artefacts are two interdependent steps, essential to many analyses such as single nucleotide variant calling, haplotype inference, sequence assembly and evolutionary studies. Both random and systematic errors can show a specific occurrence profile for each of the six prominent sequencing platforms surveyed here: 454 pyrosequencing, Complete Genomics DNA nanoball sequencing, Illumina sequencing by synthesis, Ion Torrent semiconductor sequencing, Pacific Biosciences single-molecule real-time sequencing and Oxford Nanopore sequencing. There is a large variety of programs available for error removal in sequencing read data, which differ in the error models and statistical techniques they use, the features of the data they analyse, the parameters they determine from them and the data structures and algorithms they use. We highlight the assumptions they make and for which data types these hold, providing guidance which tools to consider for benchmarking with regard to the data properties. While no benchmarking results are included here, such specific benchmarks would greatly inform tool choices and future software development. The development of stand-alone error correctors, as well as single nucleotide variant and haplotype callers, could also benefit from using more of the knowledge about error profiles and from (re)combining ideas from the existing approaches presented here.

Keywords: bias; error correction; error model; error profile; high-throughput sequencing; next-generation sequencing.

Publication types

Research Support, Non-U.S. Gov't

MeSH terms

Algorithms
Computational Biology / methods
Genome, Human
Genomics / statistics & numerical data
High-Throughput Nucleotide Sequencing / statistics & numerical data*
Humans
Polymorphism, Genetic
Sequence Alignment / statistics & numerical data
Sequence Analysis, DNA / statistics & numerical data*
Software*