NLP-Language-Models

Here, I train some n-gram language models on WikiText-2, a corpus of high-quality Wikipedia articles. The dataset was originally introduced in the following paper: https://arxiv.org/pdf/1609.07843v1.pdf. A raw version of the data can easily be viewed here: https://github.com/pytorch/examples/tree/master/word_language_model/data/wikitext-2.

I implemented 4 types of language models: a unigram model, a smoothed unigram model, a bigram model, and a smoothed bigram model.

generateSentence(self): Return a sentence that is generated by the language model. It should be a list of the form [<s>, w⁽¹⁾, ..., w⁽ⁿ⁾, </s>], where each w⁽ⁱ⁾ is a word in the vocabulary (including <UNK> but exlcuding <s> and </s>). I assume that <s> starts each sentence (with probability $1$). The following words w⁽¹⁾, ... , w⁽ⁿ⁾, </s> are generated according to language model's distribution. The number of words n is not fixed; instead, I stop the sentence as soon as I generate the stop token </s>.
getSentenceLogProbability(self, sentence): Return the logarithm of the probability of sentence, which is again a list of the form [<s>, w⁽¹⁾, ..., w⁽ⁿ⁾, </s>].
getCorpusPerplexity(self, testCorpus): Compute the perplexity (normalized inverse log probability) of testCorpus according to the model. For a corpus $W$ with $N$ words and a bigram model, Jurafsky and Martin tells us to compute perplexity as follows:

$$Perplexity(W) = \Big [ \prod_{i=1}^N \frac{1}{P(w^{(i)}|w^{(i-1)})} \Big ]^{1/N}$$

In order to avoid underflow, I did all of my calculations in log-space. That is, instead of multiplying probabilities, I added the logarithms of the probabilities and exponentiate the result:

$$\prod_{i=1}^N P(w^{(i)}|w^{(i-1)}) = \exp\Big (\sum_{i=1}^N \log P(w^{(i)}|w^{(i-1)}) \Big ) $$

See my code for more!

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LanguageModels.ipynb		LanguageModels.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP-Language-Models

About

Languages

OlaPietka/NLP-Language-Models

Folders and files

Latest commit

History

Repository files navigation

NLP-Language-Models

About

Topics

Resources

Stars

Watchers

Forks

Languages