CLISTER: A corpus for semantic textual similarity in French clinical narratives - CEA - Commissariat à l’énergie atomique et aux énergies alternatives Accéder directement au contenu
Communication Dans Un Congrès Année : 2022

CLISTER: A corpus for semantic textual similarity in French clinical narratives

Résumé

Modern Natural Language Processing relies on the availability of annotated corpora for training and evaluating models. Such resources are scarce, especially for specialized domains in languages other than English. In particular, there are very few resources for semantic similarity in the clinical domain in French. This can be useful for many biomedical natural language processing applications, including text generation. We introduce a definition of similarity that is guided by clinical facts and apply it to the development of a new French corpus of 1,000 sentence pairs manually annotated according to similarity scores. This new sentence similarity corpus is made freely available to the community. We further evaluate the corpus through experiments of automatic similarity measurement. We show that a model of sentence embeddings can capture similarity with state of the art performance on the DEFT STS shared task evaluation data set (Spearman=0.8343). We also show that the CLISTER corpus is complementary to DEFT STS.
Fichier principal
Vignette du fichier
2022.lrec-1.459.pdf (306.56 Ko) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

cea-03740484 , version 1 (29-07-2022)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

  • HAL Id : cea-03740484 , version 1

Citer

Nicolas Hiebel, Olivier Ferret, Karën Fort, Aurélie Névéol. CLISTER: A corpus for semantic textual similarity in French clinical narratives. LREC 2022 - 13th Language Resources and Evaluation Conference, Jun 2022, Marseille, France. pp.4306‑4315. ⟨cea-03740484⟩
307 Consultations
333 Téléchargements

Partager

Gmail Facebook X LinkedIn More