A dataset for open event extraction in English - CEA - Commissariat à l’énergie atomique et aux énergies alternatives Access content directly
Conference Papers Year : 2016

A dataset for open event extraction in English


This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.
Not file

Dates and versions

cea-01843179 , version 1 (18-07-2018)


  • HAL Id : cea-01843179 , version 1


K.-H. Nguyen, X. Tannier, Olivier Ferret, R. Besançon. A dataset for open event extraction in English. 10th International Conference on Language Resources and Evaluation, LREC 2016, May 2016, Portoroz, Slovenia. pp.1939-1943. ⟨cea-01843179⟩
97 View
0 Download


Gmail Facebook Twitter LinkedIn More