A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

D. Tikk, Illés Solt, Philippe Thomas, Ulf Leser

Research output: Contribution to journalArticle

16 Citations (Scopus)

Abstract

Background: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.Results: We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.Conclusions: Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

Original languageEnglish
Article number12
JournalBMC Bioinformatics
Volume14
Issue number1
DOIs
Publication statusPublished - Jan 16 2013

Fingerprint

Kernel Methods
Protein-protein Interaction
Error Analysis
Error analysis
Proteins
Protein
kernel
Kernel Function
Diverge
Gold
Ensemble
Experiments
Line
Experiment
Corpus

Keywords

  • Error analysis
  • Kernel methods
  • Kernel similarity
  • Protein-protein interaction
  • Relation extraction

ASJC Scopus subject areas

  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics
  • Structural Biology

Cite this

A detailed error analysis of 13 kernel methods for protein-protein interaction extraction. / Tikk, D.; Solt, Illés; Thomas, Philippe; Leser, Ulf.

In: BMC Bioinformatics, Vol. 14, No. 1, 12, 16.01.2013.

Research output: Contribution to journalArticle

Tikk, D. ; Solt, Illés ; Thomas, Philippe ; Leser, Ulf. / A detailed error analysis of 13 kernel methods for protein-protein interaction extraction. In: BMC Bioinformatics. 2013 ; Vol. 14, No. 1.
@article{e8d00bb73942485db23a181919a4765b,
title = "A detailed error analysis of 13 kernel methods for protein-protein interaction extraction",
abstract = "Background: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.Results: We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.Conclusions: Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.",
keywords = "Error analysis, Kernel methods, Kernel similarity, Protein-protein interaction, Relation extraction",
author = "D. Tikk and Ill{\'e}s Solt and Philippe Thomas and Ulf Leser",
year = "2013",
month = "1",
day = "16",
doi = "10.1186/1471-2105-14-12",
language = "English",
volume = "14",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",
number = "1",

}

TY - JOUR

T1 - A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

AU - Tikk, D.

AU - Solt, Illés

AU - Thomas, Philippe

AU - Leser, Ulf

PY - 2013/1/16

Y1 - 2013/1/16

N2 - Background: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.Results: We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.Conclusions: Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

AB - Background: Kernel-based classification is the current state-of-the-art for extracting pairs of interacting proteins (PPIs) from free text. Various proposals have been put forward, which diverge especially in the specific kernel function, the type of input representation, and the feature sets. These proposals are regularly compared to each other regarding their overall performance on different gold standard corpora, but little is known about their respective performance on the instance level.Results: We report on a detailed analysis of the shared characteristics and the differences between 13 current methods using five PPI corpora. We identified a large number of rather difficult (misclassified by most methods) and easy (correctly classified by most methods) PPIs. We show that kernels using the same input representation perform similarly on these pairs and that building ensembles using dissimilar kernels leads to significant performance gain. However, our analysis also reveals that characteristics shared between difficult pairs are few, which lowers the hope that new methods, if built along the same line as current ones, will deliver breakthroughs in extraction performance.Conclusions: Our experiments show that current methods do not seem to do very well in capturing the shared characteristics of positive PPI pairs, which must also be attributed to the heterogeneity of the (still very few) available corpora. Our analysis suggests that performance improvements shall be sought after rather in novel feature sets than in novel kernel functions.

KW - Error analysis

KW - Kernel methods

KW - Kernel similarity

KW - Protein-protein interaction

KW - Relation extraction

UR - http://www.scopus.com/inward/record.url?scp=84872192480&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84872192480&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-14-12

DO - 10.1186/1471-2105-14-12

M3 - Article

C2 - 23323857

AN - SCOPUS:84872192480

VL - 14

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - 1

M1 - 12

ER -