-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Resurrecting the Ngram Model #1342
Comments
It would be nice to have a working n-gram library in NLTK. SRILM has some Python wrappers for inference, but it has restrictive license. KenLM has a Python wrapper for doing inference, but it has dependencies in compilation. Neither has support for estimation. So currently there's no well-tested n-gram tools available for Python NLP. |
@anttttti Thanks for the feedback, I feel very motivated to submit a patch seeing all this demand for the feature :) Do you happen to have any thoughts about the specific issues I posted? |
The advanced smoothing methods are simple to implement once you understand that they only differ in how the discounting and interpolation are defined. Earlier papers and much of the textbook descriptions make the models seem more complicated than they are, since people didn't understand the connections that well earlier. There shouldn't be a need for separate modules, just configuration of the smoothing. The older backoff models that were not correctly normalized are not used these days, see Joshua Goodman's "A Bit of Progress in Language Modeling" for a great summary. http:https://arxiv.org/pdf/1602.02332.pdf page 63 summarizes some choices for the discounting&interpolation for a unigram case, higher order models use the same recursively. Kneser-Ney is a bit more tricky with the modified backoffs. Smoothing is not that critical for most uses. With enough data even optimized Kneser-Ney isn't better than Stupid Backoff. So just having high-order n-grams available in Python with any basic smoothing would be nice. Lidstone or Jelinek-Mercer for each order should work perfectly fine. |
Issue 1) One thing that I think would be very useful is to have a utility for building a vocabulary and censoring OOV tokens. That would correct many of the silly errors that frustrated users with the old versions. I am attaching some code that does that (feel free to use or copy) Issue 2a) I think that it's still useful to have Kneser-Ney; it's commonly taught and it's useful to have a reference implementation. |
@anttttti "The advanced smoothing methods are simple to implement once you understand that they only differ in how the discounting and interpolation are defined" @ezubaric "Issue 2b) I worry that coupling ProbDist makes this far more complicated than it needs to be" Though I haven't looked at this code in a while, my sense is that both of these statements are true. If I recall correctly, ConditionalProbDist (and more generally ProbDist) are normalized too early for use in smoothed ngram models. E.g., while we know how likely a word is in a given context, we have a hard time reasoning about the contexts themselves (I believe an earlier patch attempted to correct this issue -- despite best efforts, it was a bit kludgy [https://github.com//pull/800]). IMHO, the whole thing is slightly over engineered. |
Amen to that! I've been trying to make this work forever now (I submitted #800 and yea, it wasn't elegant at all) and I'm also starting to think there are just too many moving parts for it to be reasonable. @ezubaric thanks a bunch for that file, I'm totally borrowing its spirit for the refactoring. Based on all this feedback, here's my new take on the module structure. We have just one class: This is basically several Crucially, this class does not deal with probabilities at all. That should make things significantly simpler and at the same time more flexible. All anyone needs to do to add probabilities is use their favorite OOP method (e.g. inheritance or composition) to write a class that uses NgramModel's attributes to construct it's own If I have time I'll submit one (or two!) examples of how adding probabilities to NgramModelCounter could work. What do you folks think? |
@copper-head having similar interface to KenLM as much as possible would be good for future integration: https://github.com/kpu/kenlm/blob/master/python/example.py I think after a stable version of This function would help in the padding too: https://github.com/nltk/nltk/blob/develop/nltk/util.py#L381 |
I think what @copper-head is suggesting is a class that counts unigrams, bigrams, trigrams, etc. in a coordinated way that is convenient to consume by downstream language models. In that case, I think the kenlm API does not apply yet. (I may be wrong, but from the example posted, it doesn't look like the kenlm API deals in raw frequency counts) I think it is also worthwhile considering a minimal language model API that consumes those ngram counts. As @copper-head suggests, this would be a subclass, or better yet, a completely separate interface (allowing for vastly different implementations like https://www.projectoxford.ai/weblm). Here, I think it may be reasonable to adopt the kenlm API, but think any ngram LM interface ought to be simple enough that adapters can be easily written. I think a minimal ngram API really only needs methods to (1) compute the conditional probability of a token given a context or sequence, and (2) report on the size and makeup of the known vocabulary. Most everything else can be computed via helper methods, including computations of joint probability, as well as language generation. These helpers may or may not be part of the interface. |
Hmm, interesting point. I wonder though if keeping track of those counts for G-T might slow the training down a bit and unnecessarily so for folks who don't want to use that particular smoothing. I think it might make more sense to do the minimum in the basic |
Sorry, it looks like I accidentally deleted a post. To fill in the missing context for future readers: I think it would be good to consider common smoothing techniques when designing the NgramModelCounter API. For example, allowing users to query the number of species observed once, twice, or N times is important for Good-Turing smoothing (as well as Witten-Bell smoothing, etc.) Edit: It looks like the FreqDist class already has some of this (see: FreqDist.hapaxes and FreqDist.r_Nr) I wonder if it can be re-purposed? Or if FreqDist should be the starting point. |
I like the idea of just having a counts object which can then be queried with subclasses that implement specific smoothing methods. My only concern is that training will have issues if we don't have the ability to fix the vocabulary early: it won't be consistent with standard LM training processes, and tracking all vocabulary would cause the memory to blow up (which was a huge problem with the old LM too). |
Noted. I have ideas for how to address this. I'll be posting a PR later today. |
…look like as discussed in nltk#1342.
PR #1351 is up!! Bring on the questions/nitpicks :) |
@copper-head – how far are we away from being able to merge this back into the main branch? |
Looking at my to-do list, I'd say I need 2-3 days of focused work. |
From here I’m starting to parse the docs about the development of the now-deprecated NgramModel (line 315 of original). Most helpful so far are [this from Stack Exchange (6 JUL 16](http:https://stackoverflow.com/questions/37504391/train-ngrammodel-in-pyth on) and [this from issue 1342 (MAR-OCT 16)](nltk/nltk#1342).
@copper-head @jacobheil and NLTK users/devs who's interested in N-gram language models. Just like to check-in on the current state of the
|
I think it's definitely worth having in NLTK; it's a core part of when I
teach NLP.
Is NLTK supporting deep LMs now? Is this API compatible with that?
…--------------------
Jordan Boyd-Graber
Voice: 920.524.9464
[email protected]
http:https://boydgraber.org
--------------------
On Tue, Oct 3, 2017 at 11:32 PM, alvations ***@***.***> wrote:
@copper-head <https://github.com/copper-head> @jacobheil
<https://github.com/jacobheil> and NLTK users/devs who's interested in
N-gram language models.
Just like to check-in on the current state of the model submodule.
- Do you think it's ready to push it out into the develop/master
branch?
- Is it still a topic that people actively pursue and want to see on
NLTK?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1342 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAhoqh5oxzo2Y9mp88I8uwy4lmyNz9Amks5sovxngaJpZM4H4nGe>
.
|
Hi - I would like to use the "old" feature of language model in NLTK. What is the latest version that still has the pre-trained language model (for English)? |
For those finding this thread, I have kind of bodged together a submodule containing the old model code. |
@stevenbird I think we can close this finally :) For concrete feedback on the existing implementation, folks can open separate issues. |
@copper-head yes I agree. Congratulations! :) |
Hi folks!
I'm working on making sure the Ngram Model module could be added back into NLTK and would like to bring up a couple of issues for discussion.
Issue 1
Here @afourney said it would be nice to add interpolation as an alternative to the default Katz backoff as a way of handling unseen ngrams. I've been thinking about this and I might have an idea how this could work. I'd like to run it by all interested parties.
The current class structure of the
model
module is as follows:model.api.ModelI
-> this is supposed to be an Abstract class or an Interface, I guess.model.ngram.NgramModel
-> extends above class, contains current implementation of the ngram model.Here's what I propose:
model.api.Model
-> I'm honestly not sure I see the point of this, ambivalent on whether to keep it or ditch it.model.ngram.BasicNgramModel
-> This is the same asNgramModel
, minus everything that has to do with backoff. Basically, it can't handle ngrams unseen in training. "Why have this?" - you might ask. I think this would be a great demo of the need for backoff/interpolation. Students can simply try this out and see how badly it performs to convince themselves to use the other classes.model.ngram.BackoffNgramModel
-> Inherits fromBasicNgramModel
to yield the current implementation ofNgramModel
, except that it's more explicit about the backoff part.model.ngram.InterpolatedNgramModel
-> Also inherits fromBasicNgramModel
, but uses interpolation instead of backoff.The long-term goals here are:
a) to allow any
ProbDist
class to be used as a probability estimator since interpolation/backoff are (mostly) agnostic of the smoothing algorithm being used.b) to allow anyone who wants to optimize an NgramModel for their own purposes to be able to easily inherit some useful defaults from the classes in NLTK.
Issue 2
Unfortunately the probability module has it's own problems (eg. #602 and (my) Kneser-Ney implementation is wonky). So for now I'm only testing correctness with
LidstoneProbDist
, since it is easy to compute by hand. Should I be worried about the lack of support for the more advanced smoothing methods? Or do we want to maybe proceed this way to ensure at least that Ngram Model works, and then tackle the problematic probability classes separately?Python 3
super()
When calling
super()
, do I need to worry about supporting python 2? See this for context.The text was updated successfully, but these errors were encountered: