POS Tagger with Unknown Words Handling

This repository contains code developed for a Part Of Speech (POS) tagger using the Viberbi algorithm to predict POS tags in sentences in the Brown corpus, which is a common Natural Language Processing (NLP) task. It contains the following features:

HMM word emission frequency smoothing;
Unknown word handling;
Extra unknown words rules based on their morphological idiosyncrasies;
HMM training data saving for quicker program execution.

The evolution of the tagger's accuracy using different methods can be seen below. The report can be visited here.

Usage

Before running the program, create a new virtual environment to install Python libraries such as NLTK and run the following command:

pip install -r requirements.txt

To run the POS tagger in Python, move to the src directory and run the following command:

python main.py [-corpus <corpus_name>] [-r] [-d]

where:

-corpus: is the name of corpus to use, which can be either brown or floresta. This is an optional argument that defaults to brown if nothing is specified.
-r: is a flag that forces the program to recompute the HMM’s tag transition and word emission probabilities rather than loading previously computed versions into memory.
-d is a flag that enters debugging mode, printing additional statements on the command line.

License

see LICENSE file.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
.idea		.idea
report		report
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

POS Tagger with Unknown Words Handling

Usage

License

Contact

About

Languages

License

Adamouization/POS-Tagging-and-Unknown-Words

Folders and files

Latest commit

History

Repository files navigation

POS Tagger with Unknown Words Handling

Usage

License

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages