SNooPer

A machine learning-based method for somatic variant identification from low-pass next-generation sequencing

///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
ABSTRACT

The advent of next-generation sequencing has allowed unbiased, in-depth interrogation of cancer genomes. Many somatic variant callers have been developed yet accurate ascertainment of somatic variants remains a considerable challenge as evidenced by the consistently weak overlap between algorithms. Statistical model-based algorithms that are currently available perform well under best-case scenarios, e.g. high sequencing depth, homogenous tumor samples, high somatic variant allele frequency (VAF), but only show limited performance with sub-optimal data such as low-pass whole-genome sequencing data. We propose SNooPer, a highly versatile machine learning approach that uses Random Forest classification models to accurately call somatic variants in low-depth sequencing data. SNooPer uses a subset of variant positions from the sequencing output for which the class is known, either true variation or sequencing error, to train the data-specific model. This implicitly requires that a subset of positions be validated on an independent NGS platform or using orthogonal technology. During the training phase, multiple features including measures of quality, coverage and strand bias are extracted from the mpileups files. Features are ranked by information gain and only informative features are used to build the classification model. In the original work, using a real dataset of 40 childhood acute lymphoblastic leukemia patients, we show how the SNooPer algorithm is not affected by limited sequence coverage or low VAFs, and can be used to reduce overall sequencing costs while maintaining high specificity and sensitivity to somatic variant calling, particularly in low-depth sequencing data. While the goal of any cancer sequencing project is to identify a relevant, and limited, set of somatic variants for further sequence/functional validation, the inherently complex nature of cancer genomes combined with technical issues directly related to sequencing and alignment can affect either the specificity and/or sensitivity of most callers. The flexibility of SNooPer’s random forest protects against technical bias and systematic errors, and is appealing in that it does not rely on user-defined parameters. The SNooPer source code is freely available for acadamic users.

VIEW PAPER
///////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////