Skip to content

nrc-cnrc/vardial-2023

Repository files navigation

Multi-label Dialect Identification

Experimental code on multi-label dialect identification, developed for the paper Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis (Bernier-Colborne, Goutte, and Léger; VarDial 2023).

We provide this code for the purpose of reproducing the experiments we conducted on the FreCDo dataset. It is licensed under GPL 3.0, as it uses a library licensed under a previous version of GPL.

Requirements

The scripts below require Python (tested with version 3.9.12), and the following libraries (tested versions are in brackets):

Usage

The following commands assume that the text files containing the data are split into texts and labels, e.g.:

data/
    train.txt
    train.labels
    dev.txt
    dev.labels
    test.txt
    test.labels

This is the format produced by make_dataset.py (see below), but if you want to apply these commands to the original version of the FreCDo dataset, you will have to split the train and dev sets into separate files for texts and labels.

All the scripts mentioned below have their own internal documentation, so run python <script-name> -h for more details on usage.

To analyse exact duplicates in the data, use:

python count_dups.py data.txt data.labels
python show_dups.py data.txt data.labels

To analyse near-duplicates in the data using the Levenshtein edit ratio as similarity measure, with a cutoff at 0.8, use:

python make_sim_matrix.py data.txt sim.pkl -c 0.8 -b 1024 -p loky
python count_near_dups.py sim.pkl data.txt data.labels -m 0.8 -w log.txt -n token

where sim.pkl will contain the result of the first command.

To make a random split from the original split of the FreCDo dataset, optionally combine labels of (near) duplicates, and produce various representations of the resulting data, use:

python make_dataset.py original-data.txt original-data.labels sim.pkl dir_modified_data -m 0.8 -t 0.85 -d 0.05

where original-data.txt and original-data.labels should contain the complete source data, and dir_modified_data will contain the result.

To finetune a CamemBERT model and evaluate it, use one of the following (for single-label and multi-label classification respectively):

python finetune_single.py train.txt train.labels dev.txt dev.labels dir_checkpoint --freeze_embeddings --freeze_encoder_upto 10
python finetune_multi.py train.txt train.labels dev.txt dev.labels dir_checkpoint --freeze_embeddings --freeze_encoder_upto 10

where dir_checkpoint will contain the resulting model, the training logs, etc.

To evaluate classifiers, use:

python predict.py dir_checkpoint/checkpoint/best_model dir_checkpoint/checkpoint/tokenizer test.txt pred.labels
python evaluate.py pred.labels test.labels multi

where pred.labels will contain the predicted labels output by the first command.

Copyright

All files in this repository are Copyright (C) 2023 National Research Council Canada.

Licence

This software is licensed under GPL version 3. It relies on the Levenshtein library, which is licensed under GPL version 2 (or any later version). Licence compatibility of all python dependencies has been confirmed with licensecheck 2023.1.3.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages