Simple tricks for improving pattern-based information extraction from the biomedical literature

Quang L. Nguyen, D. Tikk, Ulf Leser

Research output: Contribution to journalArticle

7 Citations (Scopus)

Abstract

Background: Pattern-based approaches to relation extraction have shown very good results in many areas of biomedical text mining. However, defining the right set of patterns is difficult; approaches are either manual, incurring high cost, or automatic, often resulting in large sets of noisy patterns.Results: We propose several techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different extraction tasks, as defined in the recent BioNLP 2009 shared task. We focus on simple methods that only take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. We show that our techniques, despite their simplicity, yield large improvements in all tasks we analyzed. For instance, they raise the F-score for the task of extraction gene expression events from 24.8% to 51.9%.Conclusions: Already very simple filtering techniques may improve the F-score of an information extraction method based on automatically generated patterns significantly. Furthermore, the application of such methods yields a considerable speed-up, as fewer matches need to be analysed. Due to their simplicity, the proposed filtering techniques also should be applicable to other methods using linguistic patterns for information extraction.

Original languageEnglish
Article number9
JournalJournal of Biomedical Semantics
Volume1
DOIs
Publication statusPublished - Sep 24 2010

Fingerprint

Information Storage and Retrieval
Data Mining
Linguistics
Gene expression
Gene Expression
Costs and Cost Analysis
Costs

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications
  • Health Informatics

Cite this

Simple tricks for improving pattern-based information extraction from the biomedical literature. / Nguyen, Quang L.; Tikk, D.; Leser, Ulf.

In: Journal of Biomedical Semantics, Vol. 1, 9, 24.09.2010.

Research output: Contribution to journalArticle

@article{58981e0924c8406b8b13ff6566190762,
title = "Simple tricks for improving pattern-based information extraction from the biomedical literature",
abstract = "Background: Pattern-based approaches to relation extraction have shown very good results in many areas of biomedical text mining. However, defining the right set of patterns is difficult; approaches are either manual, incurring high cost, or automatic, often resulting in large sets of noisy patterns.Results: We propose several techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different extraction tasks, as defined in the recent BioNLP 2009 shared task. We focus on simple methods that only take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. We show that our techniques, despite their simplicity, yield large improvements in all tasks we analyzed. For instance, they raise the F-score for the task of extraction gene expression events from 24.8{\%} to 51.9{\%}.Conclusions: Already very simple filtering techniques may improve the F-score of an information extraction method based on automatically generated patterns significantly. Furthermore, the application of such methods yields a considerable speed-up, as fewer matches need to be analysed. Due to their simplicity, the proposed filtering techniques also should be applicable to other methods using linguistic patterns for information extraction.",
author = "Nguyen, {Quang L.} and D. Tikk and Ulf Leser",
year = "2010",
month = "9",
day = "24",
doi = "10.1186/2041-1480-1-9",
language = "English",
volume = "1",
journal = "Journal of Biomedical Semantics",
issn = "2041-1480",
publisher = "BioMed Central",

}

TY - JOUR

T1 - Simple tricks for improving pattern-based information extraction from the biomedical literature

AU - Nguyen, Quang L.

AU - Tikk, D.

AU - Leser, Ulf

PY - 2010/9/24

Y1 - 2010/9/24

N2 - Background: Pattern-based approaches to relation extraction have shown very good results in many areas of biomedical text mining. However, defining the right set of patterns is difficult; approaches are either manual, incurring high cost, or automatic, often resulting in large sets of noisy patterns.Results: We propose several techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different extraction tasks, as defined in the recent BioNLP 2009 shared task. We focus on simple methods that only take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. We show that our techniques, despite their simplicity, yield large improvements in all tasks we analyzed. For instance, they raise the F-score for the task of extraction gene expression events from 24.8% to 51.9%.Conclusions: Already very simple filtering techniques may improve the F-score of an information extraction method based on automatically generated patterns significantly. Furthermore, the application of such methods yields a considerable speed-up, as fewer matches need to be analysed. Due to their simplicity, the proposed filtering techniques also should be applicable to other methods using linguistic patterns for information extraction.

AB - Background: Pattern-based approaches to relation extraction have shown very good results in many areas of biomedical text mining. However, defining the right set of patterns is difficult; approaches are either manual, incurring high cost, or automatic, often resulting in large sets of noisy patterns.Results: We propose several techniques for filtering sets of automatically generated patterns and analyze their effectiveness for different extraction tasks, as defined in the recent BioNLP 2009 shared task. We focus on simple methods that only take into account the complexity of the pattern and the complexity of the texts the patterns are applied to. We show that our techniques, despite their simplicity, yield large improvements in all tasks we analyzed. For instance, they raise the F-score for the task of extraction gene expression events from 24.8% to 51.9%.Conclusions: Already very simple filtering techniques may improve the F-score of an information extraction method based on automatically generated patterns significantly. Furthermore, the application of such methods yields a considerable speed-up, as fewer matches need to be analysed. Due to their simplicity, the proposed filtering techniques also should be applicable to other methods using linguistic patterns for information extraction.

UR - http://www.scopus.com/inward/record.url?scp=84863237864&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863237864&partnerID=8YFLogxK

U2 - 10.1186/2041-1480-1-9

DO - 10.1186/2041-1480-1-9

M3 - Article

AN - SCOPUS:84863237864

VL - 1

JO - Journal of Biomedical Semantics

JF - Journal of Biomedical Semantics

SN - 2041-1480

M1 - 9

ER -