Discussion: Resurrecting the Ngram Model #1342

iliakur · 2016-03-25T07:53:30Z

Hi folks!

I'm working on making sure the Ngram Model module could be added back into NLTK and would like to bring up a couple of issues for discussion.

Issue 1
Here @afourney said it would be nice to add interpolation as an alternative to the default Katz backoff as a way of handling unseen ngrams. I've been thinking about this and I might have an idea how this could work. I'd like to run it by all interested parties.

The current class structure of the model module is as follows:

model.api.ModelI -> this is supposed to be an Abstract class or an Interface, I guess.
model.ngram.NgramModel -> extends above class, contains current implementation of the ngram model.

Here's what I propose:

model.api.Model -> I'm honestly not sure I see the point of this, ambivalent on whether to keep it or ditch it.
model.ngram.BasicNgramModel -> This is the same as NgramModel, minus everything that has to do with backoff. Basically, it can't handle ngrams unseen in training. "Why have this?" - you might ask. I think this would be a great demo of the need for backoff/interpolation. Students can simply try this out and see how badly it performs to convince themselves to use the other classes.
model.ngram.BackoffNgramModel -> Inherits from BasicNgramModel to yield the current implementation of NgramModel, except that it's more explicit about the backoff part.
model.ngram.InterpolatedNgramModel -> Also inherits from BasicNgramModel, but uses interpolation instead of backoff.

The long-term goals here are:

a) to allow any ProbDist class to be used as a probability estimator since interpolation/backoff are (mostly) agnostic of the smoothing algorithm being used.
b) to allow anyone who wants to optimize an NgramModel for their own purposes to be able to easily inherit some useful defaults from the classes in NLTK.

Issue 2
Unfortunately the probability module has it's own problems (eg. #602 and (my) Kneser-Ney implementation is wonky). So for now I'm only testing correctness with LidstoneProbDist, since it is easy to compute by hand. Should I be worried about the lack of support for the more advanced smoothing methods? Or do we want to maybe proceed this way to ensure at least that Ngram Model works, and then tackle the problematic probability classes separately?

Python 3 super()
When calling super(), do I need to worry about supporting python 2? See this for context.

The text was updated successfully, but these errors were encountered:

anttttti · 2016-03-28T01:12:23Z

It would be nice to have a working n-gram library in NLTK. SRILM has some Python wrappers for inference, but it has restrictive license. KenLM has a Python wrapper for doing inference, but it has dependencies in compilation. Neither has support for estimation. So currently there's no well-tested n-gram tools available for Python NLP.

iliakur · 2016-03-28T22:06:28Z

@anttttti Thanks for the feedback, I feel very motivated to submit a patch seeing all this demand for the feature :)

Do you happen to have any thoughts about the specific issues I posted?

anttttti · 2016-03-29T08:47:47Z

The advanced smoothing methods are simple to implement once you understand that they only differ in how the discounting and interpolation are defined. Earlier papers and much of the textbook descriptions make the models seem more complicated than they are, since people didn't understand the connections that well earlier. There shouldn't be a need for separate modules, just configuration of the smoothing. The older backoff models that were not correctly normalized are not used these days, see Joshua Goodman's "A Bit of Progress in Language Modeling" for a great summary. http:https://arxiv.org/pdf/1602.02332.pdf page 63 summarizes some choices for the discounting&interpolation for a unigram case, higher order models use the same recursively. Kneser-Ney is a bit more tricky with the modified backoffs.

Smoothing is not that critical for most uses. With enough data even optimized Kneser-Ney isn't better than Stupid Backoff. So just having high-order n-grams available in Python with any basic smoothing would be nice. Lidstone or Jelinek-Mercer for each order should work perfectly fine.

ezubaric · 2016-03-29T19:17:05Z

Issue 1) One thing that I think would be very useful is to have a utility for building a vocabulary and censoring OOV tokens. That would correct many of the silly errors that frustrated users with the old versions. I am attaching some code that does that (feel free to use or copy)
lm.txt

Issue 2a) I think that it's still useful to have Kneser-Ney; it's commonly taught and it's useful to have a reference implementation.
Issue 2b) I worry that coupling ProbDist makes this far more complicated than it needs to be. It might be simpler to keep the probability estimation within the language model for things like KN. For other models, it might be fine to use ProbDist.

afourney · 2016-03-29T19:34:35Z

@anttttti "The advanced smoothing methods are simple to implement once you understand that they only differ in how the discounting and interpolation are defined"

@ezubaric "Issue 2b) I worry that coupling ProbDist makes this far more complicated than it needs to be"

Though I haven't looked at this code in a while, my sense is that both of these statements are true.

If I recall correctly, ConditionalProbDist (and more generally ProbDist) are normalized too early for use in smoothed ngram models. E.g., while we know how likely a word is in a given context, we have a hard time reasoning about the contexts themselves (I believe an earlier patch attempted to correct this issue -- despite best efforts, it was a bit kludgy [https://github.com//pull/800]).

IMHO, the whole thing is slightly over engineered.

iliakur · 2016-03-29T20:28:40Z

@afourney

IMHO, the whole thing is slightly over engineered.

Amen to that! I've been trying to make this work forever now (I submitted #800 and yea, it wasn't elegant at all) and I'm also starting to think there are just too many moving parts for it to be reasonable.

@ezubaric thanks a bunch for that file, I'm totally borrowing its spirit for the refactoring.

Based on all this feedback, here's my new take on the module structure. We have just one class: model.ngram.NgramModelCounter.

This is basically several FreqDist counters combined in a clear interface. Training simply consists of recursively counting the number of ngrams as well as keeping track of the vocab size (with potentially "finalizing" some of these counts to prevent updates after training is done). @alvations I know you'd like a trie implementation for this, but I think we can start with an inefficient recursive counter for now and refactor the backend later since it shouldn't affect the interface much.

Crucially, this class does not deal with probabilities at all. That should make things significantly simpler and at the same time more flexible. All anyone needs to do to add probabilities is use their favorite OOP method (e.g. inheritance or composition) to write a class that uses NgramModel's attributes to construct it's own prob() method.

If I have time I'll submit one (or two!) examples of how adding probabilities to NgramModelCounter could work.

What do you folks think?

alvations · 2016-03-29T20:32:19Z

@copper-head having similar interface to KenLM as much as possible would be good for future integration: https://github.com/kpu/kenlm/blob/master/python/example.py

I think after a stable version of NgramModel from NLTK is up, I can try to refactor kenlm wrapper to use similar interface as NLTK ones, like what we did for scikit-learn.

This function would help in the padding too: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L381

afourney · 2016-03-29T21:24:50Z

I think what @copper-head is suggesting is a class that counts unigrams, bigrams, trigrams, etc. in a coordinated way that is convenient to consume by downstream language models. In that case, I think the kenlm API does not apply yet. (I may be wrong, but from the example posted, it doesn't look like the kenlm API deals in raw frequency counts)

I think it is also worthwhile considering a minimal language model API that consumes those ngram counts. As @copper-head suggests, this would be a subclass, or better yet, a completely separate interface (allowing for vastly different implementations like https://www.projectoxford.ai/weblm). Here, I think it may be reasonable to adopt the kenlm API, but think any ngram LM interface ought to be simple enough that adapters can be easily written.

I think a minimal ngram API really only needs methods to (1) compute the conditional probability of a token given a context or sequence, and (2) report on the size and makeup of the known vocabulary. Most everything else can be computed via helper methods, including computations of joint probability, as well as language generation. These helpers may or may not be part of the interface.

iliakur · 2016-03-29T21:58:27Z

Hmm, interesting point. I wonder though if keeping track of those counts for G-T might slow the training down a bit and unnecessarily so for folks who don't want to use that particular smoothing. I think it might make more sense to do the minimum in the basic NgramCounter class and then simply extend its training (or __init__) method in a subclass specialized for Good-Turing, or even in an implementation of the ngram API geared towards computing Good-Turing probabilities.
But I'm only just sitting down to write some of this stuff up, so maybe it won't end up being a problem in the end.

afourney · 2016-03-29T22:18:09Z

Sorry, it looks like I accidentally deleted a post. To fill in the missing context for future readers: I think it would be good to consider common smoothing techniques when designing the NgramModelCounter API. For example, allowing users to query the number of species observed once, twice, or N times is important for Good-Turing smoothing (as well as Witten-Bell smoothing, etc.)

Edit: It looks like the FreqDist class already has some of this (see: FreqDist.hapaxes and FreqDist.r_Nr) I wonder if it can be re-purposed? Or if FreqDist should be the starting point.

ezubaric · 2016-03-29T23:30:26Z

I like the idea of just having a counts object which can then be queried with subclasses that implement specific smoothing methods.

My only concern is that training will have issues if we don't have the ability to fix the vocabulary early: it won't be consistent with standard LM training processes, and tracking all vocabulary would cause the memory to blow up (which was a huge problem with the old LM too).

iliakur · 2016-03-30T07:01:49Z

Noted. I have ideas for how to address this. I'll be posting a PR later today.

…look like as discussed in nltk#1342.

iliakur · 2016-03-30T16:40:03Z

PR #1351 is up!! Bring on the questions/nitpicks :)

stevenbird · 2016-10-24T01:37:12Z

@copper-head – how far are we away from being able to merge this back into the main branch?

iliakur · 2016-10-26T19:33:38Z

Looking at my to-do list, I'd say I need 2-3 days of focused work.
Considering that I'm back to working on this in my free time from school and day job, I'd give it anywhere between 2 weeks and a month before I'm done with all my outstanding issues. This naturally doesn't take into account random stuff that might be brought to my attention in that time.

From here I’m starting to parse the docs about the development of the now-deprecated NgramModel (line 315 of original). Most helpful so far are [this from Stack Exchange (6 JUL 16](http:https://stackoverflow.com/questions/37504391/train-ngrammodel-in-pyth on) and [this from issue 1342 (MAR-OCT 16)](nltk/nltk#1342).

alvations · 2017-10-04T03:32:43Z

@copper-head @jacobheil and NLTK users/devs who's interested in N-gram language models.

Just like to check-in on the current state of the model submodule.

Do you think it's ready to push it out into the develop/master branch?
Is it still a topic that people actively pursue and want to see on NLTK?

ezubaric · 2017-10-10T21:42:55Z

I think it's definitely worth having in NLTK; it's a core part of when I teach NLP. Is NLTK supporting deep LMs now? Is this API compatible with that?

…

-------------------- Jordan Boyd-Graber Voice: 920.524.9464 [email protected] http:https://boydgraber.org --------------------

On Tue, Oct 3, 2017 at 11:32 PM, alvations ***@***.***> wrote: @copper-head <https://github.com/copper-head> @jacobheil <https://github.com/jacobheil> and NLTK users/devs who's interested in N-gram language models. Just like to check-in on the current state of the model submodule. - Do you think it's ready to push it out into the develop/master branch? - Is it still a topic that people actively pursue and want to see on NLTK? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1342 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAhoqh5oxzo2Y9mp88I8uwy4lmyNz9Amks5sovxngaJpZM4H4nGe> .

kindloaf · 2018-02-23T18:59:45Z

Hi - I would like to use the "old" feature of language model in NLTK. What is the latest version that still has the pre-trained language model (for English)?

sleepyfoxen · 2018-07-24T11:20:02Z

For those finding this thread, I have kind of bodged together a submodule containing the old model code.

https://github.com/sleepyfoxen/nltk_model

iliakur · 2018-08-25T07:02:36Z

@stevenbird I think we can close this finally :)

For concrete feedback on the existing implementation, folks can open separate issues.

stevenbird · 2018-08-25T07:42:49Z

@copper-head yes I agree. Congratulations! :)

iliakur added a commit to iliakur/nltk that referenced this issue Mar 30, 2016

Quick take on what an NgramModel interface and implementation should …

635e88d

…look like as discussed in nltk#1342.

iliakur mentioned this issue Mar 30, 2016

Bringing Language Modeling back #1351

Merged

iliakur mentioned this issue Mar 30, 2016

NgramModel fix: 3rd time's the charm #937

Merged

alvations added this to the 3.3 milestone Oct 4, 2017

alvations added corpus enhancement language-model nice idea tests labels Oct 4, 2017

stevenbird closed this as completed Aug 25, 2018

iliakur mentioned this issue Mar 22, 2022

nltk.lm.api entropy formula source? #2961

Closed

tomaarsen mentioned this issue Dec 6, 2022

Perplexity of a count based language model is always infinity #3065

Closed

mbauwens mentioned this issue Jan 26, 2024

Add reference to entropy implementation used #3229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: Resurrecting the Ngram Model #1342

Discussion: Resurrecting the Ngram Model #1342

iliakur commented Mar 25, 2016

anttttti commented Mar 28, 2016

iliakur commented Mar 28, 2016

anttttti commented Mar 29, 2016

ezubaric commented Mar 29, 2016

afourney commented Mar 29, 2016

iliakur commented Mar 29, 2016

alvations commented Mar 29, 2016

afourney commented Mar 29, 2016

iliakur commented Mar 29, 2016

afourney commented Mar 29, 2016

ezubaric commented Mar 29, 2016

iliakur commented Mar 30, 2016

iliakur commented Mar 30, 2016

stevenbird commented Oct 24, 2016

iliakur commented Oct 26, 2016

alvations commented Oct 4, 2017

ezubaric commented Oct 10, 2017 via email

kindloaf commented Feb 23, 2018

sleepyfoxen commented Jul 24, 2018

iliakur commented Aug 25, 2018

stevenbird commented Aug 25, 2018

Discussion: Resurrecting the Ngram Model #1342

Discussion: Resurrecting the Ngram Model #1342

Comments

iliakur commented Mar 25, 2016

anttttti commented Mar 28, 2016

iliakur commented Mar 28, 2016

anttttti commented Mar 29, 2016

ezubaric commented Mar 29, 2016

afourney commented Mar 29, 2016

iliakur commented Mar 29, 2016

alvations commented Mar 29, 2016

afourney commented Mar 29, 2016

iliakur commented Mar 29, 2016

afourney commented Mar 29, 2016

ezubaric commented Mar 29, 2016

iliakur commented Mar 30, 2016

iliakur commented Mar 30, 2016

stevenbird commented Oct 24, 2016

iliakur commented Oct 26, 2016

alvations commented Oct 4, 2017

ezubaric commented Oct 10, 2017 via email

kindloaf commented Feb 23, 2018

sleepyfoxen commented Jul 24, 2018

iliakur commented Aug 25, 2018

stevenbird commented Aug 25, 2018