VariantMetaCaller: Automated fusion of variant calling pipelines for quantitative, precision-based filtering

András Gézsi, Bence Bolgár, Péter Marx, Peter Sarkozy, C. Szalai, Péter Antal

Research output: Contribution to journalArticle

11 Citations (Scopus)

Abstract

Background: The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data. Results: We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision. Conclusions: VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller.

Original languageEnglish
Article number875
JournalBMC Genomics
Volume16
Issue number1
DOIs
Publication statusPublished - Oct 28 2015

Fingerprint

Exome
Benchmarking
Platinum
Computational Biology
Genome
Research
Support Vector Machine

Keywords

  • Next-generation sequencing
  • Support Vector Machine
  • Variant calling

ASJC Scopus subject areas

  • Biotechnology
  • Genetics

Cite this

VariantMetaCaller : Automated fusion of variant calling pipelines for quantitative, precision-based filtering. / Gézsi, András; Bolgár, Bence; Marx, Péter; Sarkozy, Peter; Szalai, C.; Antal, Péter.

In: BMC Genomics, Vol. 16, No. 1, 875, 28.10.2015.

Research output: Contribution to journalArticle

Gézsi, András ; Bolgár, Bence ; Marx, Péter ; Sarkozy, Peter ; Szalai, C. ; Antal, Péter. / VariantMetaCaller : Automated fusion of variant calling pipelines for quantitative, precision-based filtering. In: BMC Genomics. 2015 ; Vol. 16, No. 1.
@article{dbdf7c72a553413c851d6f8606f29957,
title = "VariantMetaCaller: Automated fusion of variant calling pipelines for quantitative, precision-based filtering",
abstract = "Background: The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data. Results: We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision. Conclusions: VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller.",
keywords = "Next-generation sequencing, Support Vector Machine, Variant calling",
author = "Andr{\'a}s G{\'e}zsi and Bence Bolg{\'a}r and P{\'e}ter Marx and Peter Sarkozy and C. Szalai and P{\'e}ter Antal",
year = "2015",
month = "10",
day = "28",
doi = "10.1186/s12864-015-2050-y",
language = "English",
volume = "16",
journal = "BMC Genomics",
issn = "1471-2164",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - VariantMetaCaller

T2 - Automated fusion of variant calling pipelines for quantitative, precision-based filtering

AU - Gézsi, András

AU - Bolgár, Bence

AU - Marx, Péter

AU - Sarkozy, Peter

AU - Szalai, C.

AU - Antal, Péter

PY - 2015/10/28

Y1 - 2015/10/28

N2 - Background: The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data. Results: We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision. Conclusions: VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller.

AB - Background: The low concordance between different variant calling methods still poses a challenge for the wide-spread application of next-generation sequencing in research and clinical practice. A wide range of variant annotations can be used for filtering call sets in order to improve the precision of the variant calls, but the choice of the appropriate filtering thresholds is not straightforward. Variant quality score recalibration provides an alternative solution to hard filtering, but it requires large-scale, genomic data. Results: We evaluated germline variant calling pipelines based on BWA and Bowtie 2 aligners in combination with GATK UnifiedGenotyper, GATK HaplotypeCaller, FreeBayes and SAMtools variant callers, using simulated and real benchmark sequencing data (NA12878 with Illumina Platinum Genomes). We argue that these pipelines are not merely discordant, but they extract complementary useful information. We introduce VariantMetaCaller to test the hypothesis that the automated fusion of measurement related information allows better performance than the recommended hard-filtering settings or recalibration and the fusion of the individual call sets without using annotations. VariantMetaCaller uses Support Vector Machines to combine multiple information sources generated by variant calling pipelines and estimates probabilities of variants. This novel method had significantly higher sensitivity and precision than the individual variant callers in all target region sizes, ranging from a few hundred kilobases to whole exomes. We also demonstrated that VariantMetaCaller supports a quantitative, precision based filtering of variants under wider conditions. Specifically, the computed probabilities of the variants can be used to order the variants, and for a given threshold, probabilities can be used to estimate precision. Precision then can be directly translated to the number of true called variants, or equivalently, to the number of false calls, which allows finding problem-specific balance between sensitivity and precision. Conclusions: VariantMetaCaller can be applied to small target regions and whole exomes as well, and it can be used in cases of organisms for which highly accurate variant call sets are not yet available, therefore it can be a viable alternative to hard filtering in cases where variant quality score recalibration cannot be used. VariantMetaCaller is freely available at http://bioinformatics.mit.bme.hu/VariantMetaCaller.

KW - Next-generation sequencing

KW - Support Vector Machine

KW - Variant calling

UR - http://www.scopus.com/inward/record.url?scp=84945534349&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84945534349&partnerID=8YFLogxK

U2 - 10.1186/s12864-015-2050-y

DO - 10.1186/s12864-015-2050-y

M3 - Article

C2 - 26510841

AN - SCOPUS:84945534349

VL - 16

JO - BMC Genomics

JF - BMC Genomics

SN - 1471-2164

IS - 1

M1 - 875

ER -