Re-train or train from scratch? Comparing pre-training strategies of BERT in the medical domain

BERT models used in specialized domains all seem to be the result of a simple strategy: initializing with the original BERT and then resuming pre-training on a specialized corpus. This method yields rather good performance (e.g. BioBERT (Lee et al., 2020), SciBERT (Beltagy et al., 2019), BlueBERT (Peng et al., 2019)). However, it seems reasonable to think that training directly on a specialized corpus, using a specialized vocabulary, could result in more tailored embeddings and thus help performance. To test this hypothesis, we train BERT models from scratch using many configurations involving general and medical corpora. Based on evaluations using four different tasks, we find that the initial corpus only has a weak influence on the performance of BERT models when these are further pre-trained on a medical corpus.

Mots clés

word embeddings contextualized embeddings BERT medical biomedical specialized domain domain adaptation word embeddings

Domaines

Traitement du texte et du document Informatique et langage [cs.CL]

Fichier principal

2022.lrec-1.281.pdf (791.22 Ko)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Pierre Zweigenbaum : Connectez-vous pour contacter le contributeur

https://hal.science/hal-03803880

Soumis le : jeudi 6 octobre 2022-23:49:27

Dernière modification le : mercredi 3 avril 2024-11:14:12

Archivage à long terme le : samedi 7 janvier 2023-19:41:05

Dates et versions

hal-03803880 , version 1 (06-10-2022)

Licence

Paternité - Pas d'utilisation commerciale

Identifiants

HAL Id : hal-03803880 , version 1

Citer

Hicham El Boukkouri, Olivier Ferret, Thomas Lavergne, Pierre Zweigenbaum. Re-train or train from scratch? Comparing pre-training strategies of BERT in the medical domain. LREC 2022 - Language Resources and Evaluation Conference, Jun 2022, Marseille, France. pp.2626-2633. ⟨hal-03803880⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CEA CNRS INRIA LIMSI CENTRALESUPELEC DRT CEA-UPSAY UNIV-PARIS-SACLAY LIST ANR LISN GS-ENGINEERING GS-COMPUTER-SCIENCE GS-SPORT-HUMAN-MOVEMENT LISN-ILES

394 Consultations

402 Téléchargements