A dataset for open event extraction in English

K.-H. Nguyen; X. Tannier; Olivier Ferret; R. Besançon

Communication Dans Un Congrès Année : 2016

A dataset for open event extraction in English

(1) , (2) , (3) , (3)

1
2
3

K.-H. Nguyen

Fonction : Auteur

Hanoi University of Science and Technology

X. Tannier

Fonction : Auteur
PersonId : 18076
IdHAL : xtannier
ORCID : 0000-0002-2452-8868
IdRef : 113391722

Laboratoire d'Informatique pour la Mécanique et les Sciences de l'Ingénieur

Olivier Ferret

Fonction : Auteur
PersonId : 14770
IdHAL : olivier-ferret
ORCID : 0000-0003-0755-2361
IdRef : 155894498

Département Intelligence Ambiante et Systèmes Interactifs

R. Besançon

Fonction : Auteur

Département Intelligence Ambiante et Systèmes Interactifs

Résumé

This article presents a corpus for development and testing of event schema induction systems in English. Schema induction is the task of learning templates with no supervision from unlabeled texts, and to group together entities corresponding to the same role in a template. Most of the previous work on this subject relies on the MUC-4 corpus. We describe the limits of using this corpus (size, non-representativeness, similarity of roles across templates) and propose a new, partially-annotated corpus in English which remedies some of these shortcomings. We make use of Wikinews to select the data inside the category Laws & Justice, and query Google search engine to retrieve different documents on the same events. Only Wikinews documents are manually annotated and can be used for evaluation, while the others can be used for unsupervised learning. We detail the methodology used for building the corpus and evaluate some existing systems on this new data.

Mots clés

Corpus creation Development and testing Event extraction Data mining Information analysis Extraction Search engines Unsupervised method Induction system Google search engine Existing systems

Domaines

Informatique [cs]

Léna Le Roy : Connectez-vous pour contacter le contributeur

https://cea.hal.science/cea-01843179

Soumis le : mercredi 18 juillet 2018-15:55:51

Dernière modification le : mercredi 3 avril 2024-11:14:12

Dates et versions

cea-01843179 , version 1 (18-07-2018)

Identifiants

HAL Id : cea-01843179 , version 1

Citer

K.-H. Nguyen, X. Tannier, Olivier Ferret, R. Besançon. A dataset for open event extraction in English. 10th International Conference on Language Resources and Evaluation, LREC 2016, May 2016, Portoroz, Slovenia. pp.1939-1943. ⟨cea-01843179⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CEA CNRS LIMSI DRT CEA-UPSAY UNIV-PARIS-SACLAY LIST SORBONNE-UNIVERSITE ANR LISN GS-ENGINEERING GS-COMPUTER-SCIENCE GS-SPORT-HUMAN-MOVEMENT

119 Consultations

0 Téléchargements

A dataset for open event extraction in English

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager