Skip to content

Towards Machine Translation of Scientific Neologisms

License

Notifications You must be signed in to change notification settings

PaulLerner/neott

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

neott

Source code and data for the paper Vers la traduction automatique des néologismes scientifiques (Towards Machine Translation of Scientific Neologisms) by Lerner and Yvon (2024, referred to as TALN 2024 hereafter).

Work done within the MaTOS ANR project.

Installation

conda create --name=neott python=3.10 
conda activate neott
conda install pytorch==2.1.1 torchvision==0.16.1 torchaudio==2.1.1 pytorch-cuda=11.8 -c pytorch -c nvidia
git clone https://github.com/PaulLerner/neott.git
pip install -e neott

Experiments

Download Data

prompt LLM

validate prompt hyperparam (template_form, TALN 2024 fig. 2 left)

python -m neott.prompt --config=exp/prompt/template_form.yaml

You can evaluate a different model by using --model_kwargs.pretrained_model_name_or_path=croissantllm/CroissantLLMBase for example.

test (TALN 2024 fig. 2 right)

python -m neott.prompt --config=exp/prompt/test.yaml

Change dataset with --eval_path=data/termium/termium.json for example.

translate with mBART

TODO

visualization

freq

python -m neott.freq data/france_terme/france_terme.json data/roots/ data/france_terme/freq_roots_fr_whole_word.json --whole_word=true --batch_size=10000

python -m neott.freq data/france_terme/france_terme.json /gpfsdswork/dataset/OSCAR/fr_meta/ data/france_terme/freq_oscar_fr_whole_word.json --whole_word=true --batch_size=10000 --hf=false

analyze

You can reproduce all analyses using neott.viz.analyze (note metrics are not recomputed but are stored in the output, you can recompute them using neott.metrics)

python -m neott.viz.analyze data/france_terme/france_terme.json exp/prompt/test/output.json --tokenizer=bigscience/bloom-7b1 --morpher=models/morph/fr/model.bin --freq_paths=data/france_terme/freq_roots_fr_whole_word.json --freq_paths+=data/france_terme/freq_oscar_fr_whole_word.json

obviously, all optional arguments are optional:

  • tokenizer is used to compute fertility (TALN 2024 fig. 4)
  • morpher for morph accuracy (TALN 2024 fig. 3)
  • freq_paths for EM wrt. term occurences (TALN 2024 fig. 5)

If you do not rerun the experiments, you can use our outputs provided in the same repositories as the datasets (e.g. france_terme/taln_2024/bloom-7b1/output.json)

morph

for each language

download data

train a classifier on SIGMORPHON/MorphyNet

generate data from SIGMORPHON/MorphyNet

python -m neott.morph.labels

train classifier

python -m neott.morph.classif train

predict on data

python -m neott.morph.classif --model_path=models/morph/fr/model.bin --lang=fr predict data/france_terme/france_terme.json

python -m neott.morph.classif --model_path=models/morph/en/model.bin --lang=en predict data/france_terme/france_terme.json

Data/preproc

The datasets provided through separate repositories above have been preprocessed with the following pipeline (no need to rerun).

python -m neott.data.{termium|franceterme}

python -m neott.data.filter

python -m neott.data.split

python -m neott.tag

citation

If you use our code or data please cite

@inproceedings{lerner:hal-04623021,
  TITLE = {{Vers la traduction automatique des n{\'e}ologismes scientifiques}},
  AUTHOR = {Lerner, Paul and Yvon, Fran{\c c}ois},
  URL = {https://inria.hal.science/hal-04623021},
  BOOKTITLE = {{35{\`e}mes Journ{\'e}es d'{\'E}tudes sur la Parole (JEP 2024) 31{\`e}me Conf{\'e}rence sur le Traitement Automatique des Langues Naturelles (TALN 2024) 26{\`e}me Rencontre des {\'E}tudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RECITAL 2024)}},
  ADDRESS = {Toulouse, France},
  EDITOR = {BALAGUER and Mathieu and BENDAHMAN and Nihed and HO-DAC and Lydia-Mai and MAUCLAIR and Julie and MORENO and Jose G and PINQUIER and Julien},
  PUBLISHER = {{ATALA \& AFPC}},
  VOLUME = {1 : articles longs et prises de position},
  PAGES = {245-261},
  YEAR = {2024},
  MONTH = Jul,
  KEYWORDS = {n{\'e}ologisme ; terminologie ; morphologie ; traduction automatique},
  PDF = {https://inria.hal.science/hal-04623021/file/9096.pdf},
  HAL_ID = {hal-04623021},
  HAL_VERSION = {v1},
}