Part-of-Speech Tagging for Classical Greek 🏺

This project implements a part-of-speech tagger for Ancient Classical Greek, a morphologically rich and low-resource language, using texts from The Ancient Greek and Latin Dependency Treebank and ideas inspired by Speech and Language Processing.

A syllable-based Naive Bayes model with Laplace/Lidstone smoothing is implemented to predict fine-grained part-of-speech using likelihood and prior estimates from the training data. A variant called StupidBayes (due to its relation to the Stupid Backoff smoothing method) is introduced that offers similar accuracy with faster performance.

This morphological approach addresses a major difficulty of part-of-speeching tagging Ancient Greek using classical methods: namely, the complex system of word endings that results in many singularly occurring words and a highly flexible word order within sentences¹.

Implementation

Multinomial Naive Bayes

MultinomialNaiveBayes² is trained on n-grams of word syllables generated from The Ancient Greek and Latin Dependency Treebank. N-gram depth is controllable via the ngram_range parameter. The first training pass counts the occurrence of syllables per category, as well as class occurrences. Greek diacritics, which are usually stripped, are preserved to give more information to the model.

The second pass normalizes the raw counts to produce probabilities (represented via log probabilities for numerical stability) for syllable likelihoods and class priors. Laplace/Lidstone smoothing is implemented using an alpha parameter that can be adjusted at runtime. As likelihood probability distributions are sparse, dictionaries and default values are used to speed up computation.

At prediction time the model uses Bayes' theorem (assuming conditional independence) to estimate the most likely class using the following relationship:

$$\begin{align*} \text{argmax}_c \log P(\text{class}_c|\text{syllables}) &= \text{argmax}_c \sum_i \log P(\text{syllable}_i|\text{class}_c) + \log P(\text{class}_c) \end{align*} $$

Stupid Bayes

StupidBayes³, on the other hand, is a variant of MultinomialNaiveBayes that skips probabilities altogether. The training pass only stores occurrence of syllables per category. The maximum n-gram length to be generated is controllable via the ngram_depth parameter. A simplified version of n-grams backoff is implemented to only generate shorter n-grams in order to fill in lookup gaps.

Predictions are generated during test time by simply returns the class with the highest count amongst all the input n-grams. Formally, this is given by:

$$\begin{align*} \text{argmax}_c (\text{class}_c|\text{syllables}) &= \sum_i C(\text{class}_c|\text{syllable}_i) \end{align*} $$

Results

Training

Module training is carried out by cgpos.eval.train. Training and test sets are produced using a shuffled split strategy. Training is parallelized via concurrent.futures to speed up training time. Results are stored in runs with the format YYYY-mm-dd_HH-MM-SS.

cgpos.eval.eval carries out hyperparameter tuning and model selection according to config.yaml. Stratified k-fold shuffle strategy (k=5) and F1 score evaluation metric are employed to handle class imbalance issues. Multioutput predictions are achieved by following the simple strategy of fitting one StupidBayes per target, and a PartOfSpeechTagger is created based on the best model configuration.

The best model is stored in models/pos_tagger.pkl and a text report showing the best model configuration, classification reports and confusion matrices is saved to reports/report.txt.

Performance

Best model

The overall fine-grained test accuracy for the best model after hyperparameter tuning is 80.19%.

Test accuracy for morphologically-driven elements (nouns, adjectives, and verbs) is 87.32%.

This is the best model architecture after hyperparameter tuning:

     pos: ('StupidBayes', {'ngram_depth': 9}),
  person: ('MultinomialNaiveBayes', {'alpha': 0.5, 'ngram_range': (1, 2)}),
  number: ('StupidBayes', {'ngram_depth': 6}),
  gender: ('MultinomialNaiveBayes', {'alpha': 0.1, 'ngram_range': (1, 3)}),
    case: ('StupidBayes', {'ngram_depth': 7}),
   tense: ('StupidBayes', {'ngram_depth': 7}),
    mood: ('MultinomialNaiveBayes', {'alpha': 0.3, 'ngram_range': (1, 2)}),
   voice: ('MultinomialNaiveBayes', {'alpha': 0.8, 'ngram_range': (1, 5)})}
  degree: ('StupidBayes', {'ngram_depth': 9}),

Classification reports

Part of speech

              precision    recall  f1-score   support

         N/A       0.01      0.33      0.02         3
     article       0.89      0.86      0.87      2983
        noun       0.94      0.89      0.91     11732
   adjective       0.81      0.90      0.85      5316
     pronoun       0.72      0.84      0.78      2921
        verb       0.94      0.89      0.91     10179
      adverb       0.57      0.89      0.70      3089
  adposition       0.96      0.91      0.93      2885
 conjunction       0.81      0.72      0.76      3133
     numeral       0.83      0.63      0.72        71
interjection       1.00      1.00      1.00        19
    particle       0.89      0.72      0.80      4810
 punctuation       1.00      0.99      1.00      5574
    accuracy                           0.87     52715
   macro avg       0.80      0.81      0.79     52715
weighted avg       0.89      0.87      0.87     52715

Person

               precision    recall  f1-score   support

          N/A       0.99      0.99      0.99     46656
 first person       0.76      0.77      0.77       839
second person       0.65      0.64      0.64       732
 third person       0.93      0.87      0.90      4488

     accuracy                           0.97     52715
    macro avg       0.83      0.82      0.83     52715
 weighted avg       0.97      0.97      0.97     52715

Number

              precision    recall  f1-score   support

         N/A       0.96      0.98      0.97     20844
    singular       0.96      0.92      0.94     22222
      plural       0.87      0.92      0.90      9516
        dual       0.64      0.92      0.76       133
    
    accuracy                           0.94     52715
   macro avg       0.86      0.94      0.89     52715
weighted avg       0.94      0.94      0.94     52715

Gender

              precision    recall  f1-score   support

         N/A       0.98      0.97      0.97     27309
   masculine       0.90      0.91      0.90     13571
    feminine       0.88      0.90      0.89      6704
      neuter       0.79      0.79      0.79      5131

    accuracy                           0.93     52715
   macro avg       0.89      0.89      0.89     52715
weighted avg       0.93      0.93      0.93     52715

Case

              precision    recall  f1-score   support

         N/A       0.98      0.95      0.96     27589
  nominative       0.80      0.87      0.83      7039
    genitive       0.92      0.92      0.92      4870
      dative       0.88      0.93      0.91      3628
  accusative       0.88      0.85      0.86      9309
    vocative       0.55      0.87      0.68       280

    accuracy                           0.92     52715
   macro avg       0.83      0.90      0.86     52715
weighted avg       0.92      0.92      0.92     52715

Tense-aspect

                 precision    recall  f1-score   support

            N/A       0.99      0.97      0.98     44010
        present       0.84      0.93      0.88      3471
      imperfect       0.78      0.92      0.84      1083
        perfect       0.81      0.90      0.85       509
plusquamperfect       0.67      0.59      0.63       105
 future perfect       0.00      0.00      0.00         1
         future       0.62      0.84      0.71       299
         aorist       0.85      0.91      0.88      3237

       accuracy                           0.96     52715
      macro avg       0.70      0.76      0.72     52715
   weighted avg       0.97      0.96      0.96     52715

Mood

              precision    recall  f1-score   support

         N/A       0.98      0.99      0.99     42640
  indicative       0.92      0.92      0.92      4707
 subjunctive       0.78      0.66      0.71       479
  infinitive       0.99      0.87      0.92      1372
  imperative       0.54      0.61      0.57       276
  participle       0.95      0.89      0.92      2922
    optative       0.89      0.72      0.80       319

    accuracy                           0.97     52715
   macro avg       0.87      0.81      0.83     52715
weighted avg       0.97      0.97      0.97     52715

Voice

               precision    recall  f1-score   support

          N/A       0.99      0.99      0.99     42987
       active       0.94      0.94      0.94      6765
      passive       0.77      0.91      0.84       261
       middle       0.84      0.81      0.83       902
medio-passive       0.93      0.84      0.89      1800

     accuracy                           0.98     52715
    macro avg       0.89      0.90      0.90     52715
 weighted avg       0.98      0.98      0.98     52715

Degree

              precision    recall  f1-score   support

         N/A       1.00      1.00      1.00     52532
    positive       0.00      0.00      0.00         1
 comparative       0.76      0.64      0.69       137
 superlative       0.37      0.67      0.48        45

    accuracy                           1.00     52715
   macro avg       0.53      0.58      0.54     52715
weighted avg       1.00      1.00      1.00     52715

Confusion matrix

true / pred	article	noun	adjective	pronoun	verb	adverb	adposition	conjunction	numeral	interjection	particle	punctuation
article	2559	3	4	405	5	7	0	0	0	0	0	0
noun	63	10393	491	179	404	73	9	33	1	0	63	0
adjective	4	253	4763	160	74	21	9	15	7	0	2	0
pronoun	69	20	302	2456	15	23	0	1	0	0	34	0
verb	1	326	242	100	9031	64	39	151	1	0	211	0
adverb	4	38	36	24	45	2744	54	65	0	0	74	0
adposition	160	3	5	17	12	50	2635	2	0	0	0	0
conjunction	4	2	4	57	3	763	5	2252	0	0	42	0
numeral	0	1	25	0	0	0	0	0	45	0	0	0
interjection	0	0	0	0	0	0	0	0	0	19	0	0
particle	7	7	11	4	24	1033	6	252	0	0	3464	0
punctuation	0	0	0	0	0	0	0	0	0	0	0	5545

Error analysis

As expected, the part-of-speech tagger does well classifying highly morphologically-driven elements such as nouns and verbs. But the model struggles to discern positionally-driven elements such as conjugations or particles. Some kind of modeling architecture that supports positional augmentation would be necessary to address this gap.

Build instructions

1. Set environment

Get project requirements by setting up a poetry environment:

Install poetry
In the terminal, run:

$ git clone https://github.com/tejomaygadgil/cgpos.git
$ cd cgpos/
$ make activate_poetry
poetry install
Installing dependencies from lock file

No dependencies to install or update

Installing the current project: cgpos (1.0.0)
...

2. Get data

Grab project data:

$ cd /dir/to/repository
$ make make_dataset
Initializing data directory
mkdir data/raw/zip
Grabbing Perseus data
...

3. Build features

Ready data for training by tokenizing word syllables:

$ cd /dir/to/repository
$ make build_features
Building features
python src/cgpos/data/features.py
...

4. Train model

Train the part-of-speech tagger and evaluate performance using:

$ cd /dir/to/repository
$ make train_model
Training model
python src/cgpos/eval/train.py
...

Hyperparameter range and model selection are adjustable via config file.

References

Help

You can get a helpfile of all available make options by running:

$ make
Available rules:

activate_poetry     Activate poetry environment 
build_features      Build features 
get_data            Get raw data 
init_data_dir       Initialize data directory 
install_poetry      Install poetry environment 
make_dataset        Make Perseus dataset 
remove_all_data     Remove all data 
remove_data         Remove processed data 
tests               Run tests
train_model         Train model and evaluate performance

Tools

This repository uses the following tools:

make for running code
poetry for package management
hydra for code reproducibility
black and ruff for code review

This is in contrast to analytic languages like English that use word order and special words like "had" and "will" to express part-of-speech. ↩
from cgpos.models.multinomial_naive_bayes import MultinomialNaiveBayes ↩
from cgpos.models.multinomial_naive_bayes import StupidBayes ↩

Name		Name	Last commit message	Last commit date
Latest commit History 593 Commits
config		config
data/reference		data/reference
models		models
notebooks		notebooks
references		references
reports		reports
runs		runs
src/cgpos		src/cgpos
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
app.py		app.py
hydria.jpg		hydria.jpg
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Part-of-Speech Tagging for Classical Greek 🏺

Implementation

Multinomial Naive Bayes

Stupid Bayes

Results

Training

Performance

Best model

Classification reports

Part of speech

Person

Number

Gender

Case

Tense-aspect

Mood

Voice

Degree

Confusion matrix

Error analysis

Build instructions

1. Set environment

2. Get data

3. Build features

4. Train model

References

Help

Tools

About

Releases

Packages

Languages

License

tejomaygadgil/cgpos

Folders and files

Latest commit

History

Repository files navigation

Part-of-Speech Tagging for Classical Greek 🏺

Implementation

Multinomial Naive Bayes

Stupid Bayes

Results

Training

Performance

Best model

Classification reports

Part of speech

Person

Number

Gender

Case

Tense-aspect

Mood

Voice

Degree

Confusion matrix

Error analysis

Build instructions

1. Set environment

2. Get data

3. Build features

4. Train model

References

Help

Tools

Footnotes

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages