Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

Joseph L. Herman, Ádám Novák, Rune Lyngsø, Adrienn Szabó, I. Miklós, Jotun Hein

Research output: Contribution to journalArticle

12 Citations (Scopus)

Abstract

Background: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results: In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions: The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference.

Original languageEnglish
Article number108
JournalBMC Bioinformatics
Volume16
Issue number1
DOIs
Publication statusPublished - Apr 1 2015

Fingerprint

Multiple Sequence Alignment
Sequence Alignment
Directed Acyclic Graph
Uncertainty
Alignment
Computational Biology
Sample Size
Large Set
Effective Sample Size
Trees (mathematics)
Bioinformatics
Averaging

Keywords

  • Alignment graphs
  • Alignment uncertainty
  • Multiple sequence alignment
  • Statistical alignment

ASJC Scopus subject areas

  • Applied Mathematics
  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications

Cite this

Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. / Herman, Joseph L.; Novák, Ádám; Lyngsø, Rune; Szabó, Adrienn; Miklós, I.; Hein, Jotun.

In: BMC Bioinformatics, Vol. 16, No. 1, 108, 01.04.2015.

Research output: Contribution to journalArticle

Herman, Joseph L. ; Novák, Ádám ; Lyngsø, Rune ; Szabó, Adrienn ; Miklós, I. ; Hein, Jotun. / Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs. In: BMC Bioinformatics. 2015 ; Vol. 16, No. 1.
@article{cea94b4f5bc5458ba953d0c948c4359d,
title = "Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs",
abstract = "Background: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results: In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions: The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference.",
keywords = "Alignment graphs, Alignment uncertainty, Multiple sequence alignment, Statistical alignment",
author = "Herman, {Joseph L.} and {\'A}d{\'a}m Nov{\'a}k and Rune Lyngs{\o} and Adrienn Szab{\'o} and I. Mikl{\'o}s and Jotun Hein",
year = "2015",
month = "4",
day = "1",
doi = "10.1186/s12859-015-0516-1",
language = "English",
volume = "16",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

AU - Herman, Joseph L.

AU - Novák, Ádám

AU - Lyngsø, Rune

AU - Szabó, Adrienn

AU - Miklós, I.

AU - Hein, Jotun

PY - 2015/4/1

Y1 - 2015/4/1

N2 - Background: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results: In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions: The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference.

AB - Background: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results: In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions: The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference.

KW - Alignment graphs

KW - Alignment uncertainty

KW - Multiple sequence alignment

KW - Statistical alignment

UR - http://www.scopus.com/inward/record.url?scp=84927640123&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84927640123&partnerID=8YFLogxK

U2 - 10.1186/s12859-015-0516-1

DO - 10.1186/s12859-015-0516-1

M3 - Article

C2 - 25888064

AN - SCOPUS:84927640123

VL - 16

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 108

ER -