SlonSpell: Neural Spell-Checker for Slovenian

This repository contains the code for running experiments in the paper "Neural Spell-Checker: Beyond Words with Synthetic Data Generation," which has been accepted to TSD 2024. The core of SlonSpell is a fine-tuned SloBERTa model, designed for improved spell-checking capabilities in Slovenian text. This README will guide you through the process of setting up the environment, generating synthetic training data, training the model, and evaluating its performance.

Installation

Prerequisites

Before you begin, ensure that you have the following software installed:

Python 3.8 or higher
PyTorch
Transformers
Additional dependencies as listed in requirements.txt

Setup

Clone the repository:

git clone https://github.com/yourusername/slonspell.git
cd slonspell

Install dependencies:

Install the required Python packages:
```
pip install -r requirements.txt
```
Download SloBERTa model:

Download the pre-trained SloBERTa model from Hugging Face and store it in a folder named sloBERTaModel:
```
mkdir sloBERTaModel
# Download the model from Hugging Face and place it in this directory.
```

Data Preparation

To fine-tune the SloBERTa model, you need to prepare the raw text data:

Prepare raw text data:

Place your raw text files in a directory named data_folders.
```
mkdir data_folders
# Add your raw text files to this directory.
```

Synthetic Data Generation

The model training relies on a synthetic dataset generated from the raw text data. Follow these steps:

Generate synthetic data:

Run the prepare_train_data_BERT_model.py script to generate the synthetic dataset:
```
python prepare_train_data_BERT_model.py
```
The synthetic dataset will be stored in a directory named train_data.

Model Training

Once the synthetic dataset is ready, you can fine-tune the SloBERTa model:

Train the model:

Run the train_sloBERTa_model.py script to start the training process:
```
python train_sloBERTa_model.py
```
The trained model will be saved in the specified output directory.

Model Evaluation

After training the model, you can evaluate its performance using the provided scripts:

Align model predictions:

First, align the model predictions with the source and target data by running the align_file function in evaluate.py:
Annotate the aligned file:

Annotate the aligned file by marking spelling mistakes with the text NAPAKA/Č before each error.
Evaluate the model:

Use the evaluate_on_annotated_file function to get the final score of the model:

Acknowledgments

This project is based on the research presented in "Neural Spell-Checker: Beyond Words with Synthetic Data Generation," accepted to TSD 2024. We gratefully acknowledge the support of the research community and the contributors to the SloBERTa model.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
Evaluate_models		Evaluate_models
data_folders/ucne_mnozice		data_folders/ucne_mnozice
prepare_dataset_functions		prepare_dataset_functions
train_data		train_data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
slospell.py		slospell.py
train_sloBERTa_model.py		train_sloBERTa_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SlonSpell: Neural Spell-Checker for Slovenian

Table of Contents

Installation

Prerequisites

Setup

Data Preparation

Synthetic Data Generation

Model Training

Model Evaluation

Acknowledgments

License

About

Releases

Packages

Contributors 2

Languages

License

matejklemen/slonspell

Folders and files

Latest commit

History

Repository files navigation

SlonSpell: Neural Spell-Checker for Slovenian

Table of Contents

Installation

Prerequisites

Setup

Data Preparation

Synthetic Data Generation

Model Training

Model Evaluation

Acknowledgments

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages