A highly accurate Named Entity corpus for Hungarian

György Szarvas, Richárd Farkas, László Felföldi, András Kocsor, J. Csirik

Research output: Contribution to conferencePaper

13 Citations (Scopus)

Abstract

A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89%. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86% F measure on the corpus.

Original languageEnglish
Pages1957-1960
Number of pages4
Publication statusPublished - Jan 1 2006
Event5th International Conference on Language Resources and Evaluation, LREC 2006 - Genoa, Italy
Duration: May 22 2006May 28 2006

Other

Other5th International Conference on Language Resources and Evaluation, LREC 2006
CountryItaly
CityGenoa
Period5/22/065/28/06

Fingerprint

news agency
news
experiment
learning
experience
Entity
Annotation
Tagging

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Cite this

Szarvas, G., Farkas, R., Felföldi, L., Kocsor, A., & Csirik, J. (2006). A highly accurate Named Entity corpus for Hungarian. 1957-1960. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.

A highly accurate Named Entity corpus for Hungarian. / Szarvas, György; Farkas, Richárd; Felföldi, László; Kocsor, András; Csirik, J.

2006. 1957-1960 Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.

Research output: Contribution to conferencePaper

Szarvas, G, Farkas, R, Felföldi, L, Kocsor, A & Csirik, J 2006, 'A highly accurate Named Entity corpus for Hungarian', Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, 5/22/06 - 5/28/06 pp. 1957-1960.
Szarvas G, Farkas R, Felföldi L, Kocsor A, Csirik J. A highly accurate Named Entity corpus for Hungarian. 2006. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.
Szarvas, György ; Farkas, Richárd ; Felföldi, László ; Kocsor, András ; Csirik, J. / A highly accurate Named Entity corpus for Hungarian. Paper presented at 5th International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy.4 p.
@conference{8d6931a45c964719837cabec3f1f0024,
title = "A highly accurate Named Entity corpus for Hungarian",
abstract = "A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89{\%}. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86{\%} F measure on the corpus.",
author = "Gy{\"o}rgy Szarvas and Rich{\'a}rd Farkas and L{\'a}szl{\'o} Felf{\"o}ldi and Andr{\'a}s Kocsor and J. Csirik",
year = "2006",
month = "1",
day = "1",
language = "English",
pages = "1957--1960",
note = "5th International Conference on Language Resources and Evaluation, LREC 2006 ; Conference date: 22-05-2006 Through 28-05-2006",

}

TY - CONF

T1 - A highly accurate Named Entity corpus for Hungarian

AU - Szarvas, György

AU - Farkas, Richárd

AU - Felföldi, László

AU - Kocsor, András

AU - Csirik, J.

PY - 2006/1/1

Y1 - 2006/1/1

N2 - A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89%. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86% F measure on the corpus.

AB - A highly accurate Named Entity (NE) corpus for Hungarian that is publicly available for research purposes is introduced in the paper, along with its main properties. The results of experiments that apply various Machine Learning models and classifier combination schemes are also presented to serve as a benchmark for further research based on the corpus. The data is a segment of the Szeged Corpus (Csendes et al., 2004), consisting of short business news articles collected from MTI (Hungarian News Agency, www.mti.hu). The annotation procedure was carried out paying special attention to annotation accuracy. The corpus went through a parallel annotation phase done by two annotators, resulting in a tagging with inter-annotator agreement rate of 99.89%. Controversial taggings were collected and discussed by the two annotators and a linguist with several years of experience in corpus annotation. These examples were tagged following the decision they made together, and finally all entities that had suspicious or dubious annotations were collected and checked for consistency. We consider the result of this correcting process virtually be free of errors. Our best performing Named Entity Recognizer (NER) model attained an accuracy of 92.86% F measure on the corpus.

UR - http://www.scopus.com/inward/record.url?scp=85037134787&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85037134787&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85037134787

SP - 1957

EP - 1960

ER -