A dataset for open event extraction in English - CEA - Commissariat à l’énergie atomique et aux énergies alternatives Accéder directement au contenu
Communication Dans Un Congrès Année : 2016

A dataset for open event extraction in English

Résumé

This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.
Fichier non déposé

Dates et versions

cea-01843179 , version 1 (18-07-2018)

Identifiants

  • HAL Id : cea-01843179 , version 1

Citer

K.-H. Nguyen, X. Tannier, Olivier Ferret, R. Besançon. A dataset for open event extraction in English. 10th International Conference on Language Resources and Evaluation, LREC 2016, May 2016, Portoroz, Slovenia. pp.1939-1943. ⟨cea-01843179⟩
118 Consultations
0 Téléchargements

Partager

Gmail Facebook X LinkedIn More