Skip to content

Perceptron model from scratch for part-of-speech prediction.

Notifications You must be signed in to change notification settings

vrivier/pos-tagger

Repository files navigation

pos-tagger

This project is a pos-tagger. It determines the part of speech of each token of a given text.

It works with supervised machine learning. The core algorithm is called perceptron. It is the first neural network ever designed. It needs a learning corpus with parts of speech already assigned.


To run it : simply execute test_features.py. The execution will give you as output the percentage of correct predictions on the test dataset.


The data files consist in a list of sentences. Each sentence is a list of words, and comes with a list of the corresponding tags. Here is the first sentence of the corpus as an example : [['Les', 'commotions', 'cérébrales', 'sont', 'devenu', 'si', 'courantes', 'dans', 'ce', 'sport', "qu'", 'on', 'les', 'considére', 'presque', 'comme', 'la', 'routine', '.'], ['DET', 'NOUN', 'ADJ', 'AUX', 'VERB', 'ADV', 'ADJ', 'ADP', 'DET', 'NOUN', 'SCONJ', 'PRON', 'PRON', 'VERB', 'ADV', 'ADV', 'DET', 'NOUN', 'PUNCT']]

The data is in french.

For classifying each word, the main features are the word itself + surrounding words in a window of 2 (left and right, total words = 5) and suffixes.

Features can be found in file test_features.py , function get_features .

About

Perceptron model from scratch for part-of-speech prediction.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages