A highly accurate Named Entity corpus for Hungarian

György Szarvas, Richárd Farkas, László Felföldi, András Kocsor, J. Csirik

Research output: Paper

14 Citations (Scopus)

Abstract

A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89%. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86% F measure on the corpus.

Original languageEnglish
Pages1957-1960
Number of pages4
Publication statusPublished - jan. 1 2006
Event5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy
Duration: máj. 22 2006máj. 28 2006

Other

Other5th International Conference on Language Resources and Evaluation, LREC 2006
CountryItaly
CityGenoa
Period5/22/065/28/06

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Fingerprint Dive into the research topics of 'A highly accurate Named Entity corpus for Hungarian'. Together they form a unique fingerprint.

  • Cite this

    Szarvas, G., Farkas, R., Felföldi, L., Kocsor, A., & Csirik, J. (2006). A highly accurate Named Entity corpus for Hungarian. 1957-1960. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.