Skip to content

MLSP2024/MLSP_LCP_Baseline

Repository files navigation

MLSP_LCP_Baseline

An LCP baseline for the Multilingual Lexical Simplification Pipeline 2024 Shared Task modelled as a linear regression on log-frequency. The frequency baseline is trained using log-frequency (minimum value if the target consists of multiple tokens) on the trial set for each language. We use frequencies provided by the wordfreq package when possible. Additionally, since the package uses an incompatible tokenization for Japanese and does not provide any data for Sinahala, we use TUBELEX-JA for Japanese, and the word frequency list for Sinhala.

Reproducing the baseline

Note that the trained models and output of the baseline are already included in the repository. You can reproduce them by following the steps below.

  1. Install the Git submodule for MLSP_Data, Word-Frequency-List-for-Sinhala and tubelex:

    git submodule init && git submodule update

  2. Install the requirements:

    python -m pip install -r requirements.txt

  3. Run the baseline (both training and prediction):

    bash experiments.sh

Links