Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Resurrecting the Ngram Model #1342

Closed
iliakur opened this issue Mar 25, 2016 · 21 comments
Closed

Discussion: Resurrecting the Ngram Model #1342

iliakur opened this issue Mar 25, 2016 · 21 comments

Comments

@iliakur
Copy link
Contributor

iliakur commented Mar 25, 2016

Hi folks!

I'm working on making sure the Ngram Model module could be added back into NLTK and would like to bring up a couple of issues for discussion.

Issue 1
Here @afourney said it would be nice to add interpolation as an alternative to the default Katz backoff as a way of handling unseen ngrams. I've been thinking about this and I might have an idea how this could work. I'd like to run it by all interested parties.

The current class structure of the model module is as follows:

  • model.api.ModelI -> this is supposed to be an Abstract class or an Interface, I guess.
  • model.ngram.NgramModel -> extends above class, contains current implementation of the ngram model.

Here's what I propose:

  • model.api.Model -> I'm honestly not sure I see the point of this, ambivalent on whether to keep it or ditch it.
  • model.ngram.BasicNgramModel -> This is the same as NgramModel, minus everything that has to do with backoff. Basically, it can't handle ngrams unseen in training. "Why have this?" - you might ask. I think this would be a great demo of the need for backoff/interpolation. Students can simply try this out and see how badly it performs to convince themselves to use the other classes.
  • model.ngram.BackoffNgramModel -> Inherits from BasicNgramModel to yield the current implementation of NgramModel, except that it's more explicit about the backoff part.
  • model.ngram.InterpolatedNgramModel -> Also inherits from BasicNgramModel, but uses interpolation instead of backoff.

The long-term goals here are:

a) to allow any ProbDist class to be used as a probability estimator since interpolation/backoff are (mostly) agnostic of the smoothing algorithm being used.
b) to allow anyone who wants to optimize an NgramModel for their own purposes to be able to easily inherit some useful defaults from the classes in NLTK.

Issue 2
Unfortunately the probability module has it's own problems (eg. #602 and (my) Kneser-Ney implementation is wonky). So for now I'm only testing correctness with LidstoneProbDist, since it is easy to compute by hand. Should I be worried about the lack of support for the more advanced smoothing methods? Or do we want to maybe proceed this way to ensure at least that Ngram Model works, and then tackle the problematic probability classes separately?

Python 3 super()
When calling super(), do I need to worry about supporting python 2? See this for context.

@anttttti
Copy link

It would be nice to have a working n-gram library in NLTK. SRILM has some Python wrappers for inference, but it has restrictive license. KenLM has a Python wrapper for doing inference, but it has dependencies in compilation. Neither has support for estimation. So currently there's no well-tested n-gram tools available for Python NLP.

@iliakur
Copy link
Contributor Author

iliakur commented Mar 28, 2016

@anttttti Thanks for the feedback, I feel very motivated to submit a patch seeing all this demand for the feature :)

Do you happen to have any thoughts about the specific issues I posted?

@anttttti
Copy link

The advanced smoothing methods are simple to implement once you understand that they only differ in how the discounting and interpolation are defined. Earlier papers and much of the textbook descriptions make the models seem more complicated than they are, since people didn't understand the connections that well earlier. There shouldn't be a need for separate modules, just configuration of the smoothing. The older backoff models that were not correctly normalized are not used these days, see Joshua Goodman's "A Bit of Progress in Language Modeling" for a great summary. http:https://arxiv.org/pdf/1602.02332.pdf page 63 summarizes some choices for the discounting&interpolation for a unigram case, higher order models use the same recursively. Kneser-Ney is a bit more tricky with the modified backoffs.

Smoothing is not that critical for most uses. With enough data even optimized Kneser-Ney isn't better than Stupid Backoff. So just having high-order n-grams available in Python with any basic smoothing would be nice. Lidstone or Jelinek-Mercer for each order should work perfectly fine.

@ezubaric
Copy link
Contributor

Issue 1) One thing that I think would be very useful is to have a utility for building a vocabulary and censoring OOV tokens. That would correct many of the silly errors that frustrated users with the old versions. I am attaching some code that does that (feel free to use or copy)
lm.txt

Issue 2a) I think that it's still useful to have Kneser-Ney; it's commonly taught and it's useful to have a reference implementation.
Issue 2b) I worry that coupling ProbDist makes this far more complicated than it needs to be. It might be simpler to keep the probability estimation within the language model for things like KN. For other models, it might be fine to use ProbDist.

@afourney
Copy link

@anttttti "The advanced smoothing methods are simple to implement once you understand that they only differ in how the discounting and interpolation are defined"

@ezubaric "Issue 2b) I worry that coupling ProbDist makes this far more complicated than it needs to be"

Though I haven't looked at this code in a while, my sense is that both of these statements are true.

If I recall correctly, ConditionalProbDist (and more generally ProbDist) are normalized too early for use in smoothed ngram models. E.g., while we know how likely a word is in a given context, we have a hard time reasoning about the contexts themselves (I believe an earlier patch attempted to correct this issue -- despite best efforts, it was a bit kludgy [https://github.com//pull/800]).

IMHO, the whole thing is slightly over engineered.

@iliakur
Copy link
Contributor Author

iliakur commented Mar 29, 2016

@afourney

IMHO, the whole thing is slightly over engineered.

Amen to that! I've been trying to make this work forever now (I submitted #800 and yea, it wasn't elegant at all) and I'm also starting to think there are just too many moving parts for it to be reasonable.

@ezubaric thanks a bunch for that file, I'm totally borrowing its spirit for the refactoring.

Based on all this feedback, here's my new take on the module structure. We have just one class: model.ngram.NgramModelCounter.

This is basically several FreqDist counters combined in a clear interface. Training simply consists of recursively counting the number of ngrams as well as keeping track of the vocab size (with potentially "finalizing" some of these counts to prevent updates after training is done). @alvations I know you'd like a trie implementation for this, but I think we can start with an inefficient recursive counter for now and refactor the backend later since it shouldn't affect the interface much.

Crucially, this class does not deal with probabilities at all. That should make things significantly simpler and at the same time more flexible. All anyone needs to do to add probabilities is use their favorite OOP method (e.g. inheritance or composition) to write a class that uses NgramModel's attributes to construct it's own prob() method.

If I have time I'll submit one (or two!) examples of how adding probabilities to NgramModelCounter could work.

What do you folks think?

@alvations
Copy link
Contributor

@copper-head having similar interface to KenLM as much as possible would be good for future integration: https://github.com/kpu/kenlm/blob/master/python/example.py

I think after a stable version of NgramModel from NLTK is up, I can try to refactor kenlm wrapper to use similar interface as NLTK ones, like what we did for scikit-learn.

This function would help in the padding too: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L381

@afourney
Copy link

I think what @copper-head is suggesting is a class that counts unigrams, bigrams, trigrams, etc. in a coordinated way that is convenient to consume by downstream language models. In that case, I think the kenlm API does not apply yet. (I may be wrong, but from the example posted, it doesn't look like the kenlm API deals in raw frequency counts)

I think it is also worthwhile considering a minimal language model API that consumes those ngram counts. As @copper-head suggests, this would be a subclass, or better yet, a completely separate interface (allowing for vastly different implementations like https://www.projectoxford.ai/weblm). Here, I think it may be reasonable to adopt the kenlm API, but think any ngram LM interface ought to be simple enough that adapters can be easily written.

I think a minimal ngram API really only needs methods to (1) compute the conditional probability of a token given a context or sequence, and (2) report on the size and makeup of the known vocabulary. Most everything else can be computed via helper methods, including computations of joint probability, as well as language generation. These helpers may or may not be part of the interface.

@iliakur
Copy link
Contributor Author

iliakur commented Mar 29, 2016

Hmm, interesting point. I wonder though if keeping track of those counts for G-T might slow the training down a bit and unnecessarily so for folks who don't want to use that particular smoothing. I think it might make more sense to do the minimum in the basic NgramCounter class and then simply extend its training (or __init__) method in a subclass specialized for Good-Turing, or even in an implementation of the ngram API geared towards computing Good-Turing probabilities.
But I'm only just sitting down to write some of this stuff up, so maybe it won't end up being a problem in the end.

@afourney
Copy link

Sorry, it looks like I accidentally deleted a post. To fill in the missing context for future readers: I think it would be good to consider common smoothing techniques when designing the NgramModelCounter API. For example, allowing users to query the number of species observed once, twice, or N times is important for Good-Turing smoothing (as well as Witten-Bell smoothing, etc.)

Edit: It looks like the FreqDist class already has some of this (see: FreqDist.hapaxes and FreqDist.r_Nr) I wonder if it can be re-purposed? Or if FreqDist should be the starting point.

@ezubaric
Copy link
Contributor

I like the idea of just having a counts object which can then be queried with subclasses that implement specific smoothing methods.

My only concern is that training will have issues if we don't have the ability to fix the vocabulary early: it won't be consistent with standard LM training processes, and tracking all vocabulary would cause the memory to blow up (which was a huge problem with the old LM too).

@iliakur
Copy link
Contributor Author

iliakur commented Mar 30, 2016

Noted. I have ideas for how to address this. I'll be posting a PR later today.

iliakur added a commit to iliakur/nltk that referenced this issue Mar 30, 2016
@iliakur
Copy link
Contributor Author

iliakur commented Mar 30, 2016

PR #1351 is up!! Bring on the questions/nitpicks :)

@stevenbird
Copy link
Member

@copper-head – how far are we away from being able to merge this back into the main branch?

@iliakur
Copy link
Contributor Author

iliakur commented Oct 26, 2016

Looking at my to-do list, I'd say I need 2-3 days of focused work.
Considering that I'm back to working on this in my free time from school and day job, I'd give it anywhere between 2 weeks and a month before I'm done with all my outstanding issues. This naturally doesn't take into account random stuff that might be brought to my attention in that time.

jacobheil added a commit to jacobheil/sonnetizer that referenced this issue Feb 18, 2017
From here I’m starting to parse the docs about the development of the
now-deprecated NgramModel (line 315 of original). Most helpful so far
are [this from Stack Exchange (6 JUL
16](http:https://stackoverflow.com/questions/37504391/train-ngrammodel-in-pyth
on) and [this from issue 1342 (MAR-OCT
16)](nltk/nltk#1342).
@alvations
Copy link
Contributor

@copper-head @jacobheil and NLTK users/devs who's interested in N-gram language models.

Just like to check-in on the current state of the model submodule.

  • Do you think it's ready to push it out into the develop/master branch?
  • Is it still a topic that people actively pursue and want to see on NLTK?

@ezubaric
Copy link
Contributor

ezubaric commented Oct 10, 2017 via email

@kindloaf
Copy link

Hi - I would like to use the "old" feature of language model in NLTK. What is the latest version that still has the pre-trained language model (for English)?

@sleepyfoxen
Copy link

For those finding this thread, I have kind of bodged together a submodule containing the old model code.

https://github.com/sleepyfoxen/nltk_model

@iliakur
Copy link
Contributor Author

iliakur commented Aug 25, 2018

@stevenbird I think we can close this finally :)

For concrete feedback on the existing implementation, folks can open separate issues.

@stevenbird
Copy link
Member

@copper-head yes I agree. Congratulations! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants