Robust and exact structural variation detection with paired-end and soft-clipped alignments: SoftSV compared with eight algorithms

Brief Bioinform. 2016 Jan;17(1):51-62. doi: 10.1093/bib/bbv028. Epub 2015 May 20.

Abstract

Structural variation (SV) plays an important role in genetic diversity among the population in general and specifically in diseases such as cancer. Modern next-generation sequencing (NGS) technologies provide paired-end sequencing data at high depth with increasing read lengths. This development enabled the analysis of split-reads to detect SV breakpoints with single-nucleotide resolution. But ambiguous mappings and breakpoint sequences with further co-occurring mutations hamper split-read alignments against a reference sequence. The trade-off between high sensitivity and low false-positive rate is problematic and often requires a lot of fine-tuning of the analysis method based on knowledge about its algorithm and the characteristics of the data set. We present SoftSV, a method for exact breakpoint detection for small and large deletions, inversions, tandem duplications and inter-chromosomal translocations, which relies solely on the mutual alignment of soft-clipped reads within the neighborhood of discordantly mapped paired-end reads. Unlike other SV detection algorithms, our approach does not require thresholds regarding sequencing coverage or mapping quality. We evaluate SoftSV together with eight approaches (Breakdancer, Clever, CREST, Delly, GASVPro, Pindel, Socrates and SoftSearch) on simulated and real data sets. Our results show that sensitive and reliable SV detection is subject to many different factors like read length, sequence coverage and SV type. While most programs have their individual drawbacks, our greedy approach turns out to be the most robust and sensitive on many experimental setups. Sensitivities above 85% and positive predictive values between 80 and 100% could be achieved consistently for all SV types on simulated data sets starting at relatively short 75 bp reads and low 10-15× sequence coverage.

Keywords: structural variation; paired-end sequencing; split-reads; simulation.

Publication types

  • Comparative Study
  • Research Support, Non-U.S. Gov't

MeSH terms

  • Algorithms*
  • Chromosome Breakpoints*
  • Computational Biology
  • Databases, Nucleic Acid / statistics & numerical data
  • Genetic Variation*
  • Genome, Human
  • High-Throughput Nucleotide Sequencing / statistics & numerical data
  • Humans
  • Mutation
  • Sequence Alignment / statistics & numerical data*
  • Sequence Analysis, DNA
  • Sequence Deletion
  • Software