Identification and correction of erroneous protein sequences in public databases

Research output: Chapter in Book/Report/Conference proceedingChapter


Correct prediction of the structure of protein-coding genes of higher eukaryotes is a difficult task therefore public sequence databases incorporating predicted sequences are increasingly contaminated with erroneous sequences. The high rate of misprediction has serious consequences since it significantly affects the conclusions that may be drawn from genome-scale sequence analyses. Here we describe the MisPred and FixPred approaches that may help the identification and correction of erroneous sequences. The rationale of these approaches is that a protein sequence is likely to be erroneous if some of its features conflict with our current knowledge about proteins.

Original languageEnglish
Title of host publicationMethods in Molecular Biology
PublisherHumana Press Inc.
Number of pages14
Publication statusPublished - Aug 1 2016

Publication series

NameMethods in Molecular Biology
ISSN (Print)10643745



  • Gene prediction
  • Genome annotation
  • Genome assembly
  • Misannotation
  • Misassembly
  • Misprediction
  • Protein-coding genes
  • Proteins
  • Sequencing errors

ASJC Scopus subject areas

  • Medicine(all)
  • Molecular Biology
  • Genetics

Cite this

Patthy, L. (2016). Identification and correction of erroneous protein sequences in public databases. In Methods in Molecular Biology (Vol. 1415, pp. 179-192). (Methods in Molecular Biology; Vol. 1415). Humana Press Inc..