Skip to Main content Skip to Navigation
Conference papers

A dataset for open event extraction in English

Abstract : This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.
Document type :
Conference papers
Complete list of metadata
Contributor : Léna Le Roy Connect in order to contact the contributor
Submitted on : Wednesday, July 18, 2018 - 3:55:51 PM
Last modification on : Saturday, June 25, 2022 - 10:32:43 PM


  • HAL Id : cea-01843179, version 1


K.-H. Nguyen, X. Tannier, Olivier Ferret, R. Besançon. A dataset for open event extraction in English. 10th International Conference on Language Resources and Evaluation, LREC 2016, May 2016, Portoroz, Slovenia. pp.1939-1943. ⟨cea-01843179⟩



Record views