-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Theory question: using word vectors for similarity generation #25
Comments
Hi, Indeed, overfitting is a problem that I also encountered. This can actually be expected in any process of creating a dataset for an AI-related task, even though you are right that rule-based generation of datasets makes overfitting more likely (at least in my experience). I thought about using machine learning models to improve generation as well. I could actually find a few projects which are intended to do that kind of thing: the bitext project notably discusses this here, and I think code to "expand" the number of examples can be found in this repo. I feel like using word2vec or BERT are theoretically good ideas, but are much more tricky in practice, since the user would be required to have a dataset representative of the kind of sentences the final product would encounter. That dataset needs to be broad enough to be useful, but not too broad to cover only the domain of interest and to avoid noise from getting in the final dataset. And even if such a dataset is available (which is generally not the case), the overfitting problem still remains, it was just moved to another dataset. For that reason, I am not convinced using such a model would really help addressing this problem. That being said, in some cases, I think it could be helpful. If I were to design the kind of system you propose, I'd rather make it replace some placeholders in the generated sentences than embed it in Chatette itself. I'd rather keep the DSL as deterministic as possible, in order for the person who writes the templates to have as much control as possible. I would be very interested in seeing such a system implemented. About the problem of entities with high variability but which are very common (e.g. numbers, dates, weight, etc), I don't really see a clear solution. I thought of providing predefine templates for the most common cases, but I can't define those for any language and I want Chatette to be usable in any language (and I know that Chatette has been used in a quite large variety of languages such as French, Korean or Mandarin). Thank you for your suggestions and questions, they made me reflect more on the theory behind this project :) |
Hi, Thanks for the swift and comprehensive response. Firstly I had no idea bitext was looking at that exact problem and has some code for it, many thanks! Regarding the other points: I'm definitely not suggesting moving in any significant way away from the DSL framework to entirely autogenerated stuff as that would be dangerous. Rather I was thinking more along the lines of what you mention in terms of 'expansion' or in my mind a kind of slightly-more-advanced synonym substitution. To your point about moving the overfitting from one set to another I agree this is a risk but I think given the relatively low level of interaction the code would have with the final sentence (meaning that it is not generating sentences without any other input, rather just modifying existing ones). For example, what I'm imagining is that, unless the language used is quite technical, a large BERT model might be a good way of performing limited word substitutions to existing sentences without breaking the context too much. The assumption here is that BERT is general enough to be able to capture the majority of the 'base language stuff' and the existing DSL takes care of speciality areas. For example in the DSL you might tag certain parts of the sentence as open to BERT substitution and others fixed. Regarding the dataset needed: were the user to train word2vec from scratch I would agree this is risky because you may fail with generalisation but i'm hoping the hedging outlined above would mitigate that (won't know until we try i guess!). But i still think it's useful in the, i suspect common, case where a user might have a large corpus of data they know to be a good representation of the output space but it's too hard/expensive to label it all. In that way this is kind of a proxy for semi-supervised learning (the supervision in this case being the DSL part). I agree it won't be a panacea for the problem of data generation for NLU, but i think that if you can add even a modest multiplier to the amount of labeled data you have from DSL, say 1.3x, that could be useful (and the system would no doubt have a variety of tolerance parameters to play with to make sure that extra 30% was as useful as possible) Thanks again for the package and discussion! I very much appreciate your input will hopefully get some time shortly to experiment with this (famous last words...) |
Hey, I think we are on the same page then: adding "expansion" capabilities to Chatette seems to be a good idea. Transformer-based algorithms such as BERT seem to be the best fit for that task. I think I'll need to experiment on this, possibly with the tool you referenced. (Nlpaug seems very promising, but maybe not mature enough yet.) However, I want to make Chatette a more polished product and add a few features to it before I take time to dive into this. I'll document my findings on this issue when I do, but I guess at least a few months will pass before I come back to this. Feel free to experiment on your side and keep me posted if you feel like it! Thanks for this discussion! I appreciate the reflection you put into this. |
Hey,
I'm a big fan of Rasa and these NLU-set generation platforms, but in my experience (as noted) they can quickly lead to overfitting as it can be hard to generate the true range of data you might expect from real labeled data (perhaps an unrealistic expectation).
I think, in part, the reason for this is the inability of rule based substitutions/synonyms to really capture this variety.
Thus, I wonder if it might be useful to explore substitutions based on some unsupervised embeddings. For example, rather than specifying synonyms we use a word2vec model to choose words based on similarity. One could even go further and use something like BERT to utilise context.
This might seem a bit circular but in my mind is akin to semi-supervised learning. The assumption would be of course that the word vectors are appropriate to the domain or application. That being said, the w2v process is unsupervised and so people who do have domain specific data, even if unlabelled, could benefit from it.
A couple of motivating examples. I have been looking at building an NLU system to extract intent and entities for a chatbot to give quotes for a freight company. Part of the issue here is that some of the entities needed are address components (cities, suburbs, postcodes) which can be a bit tricky, even with a gazetter. The paragraphs we would like to process are also sometimes quite long. I have found that intent classification was quite straight forward, but the NER was harder (also needs to extract dimensions like length, width, height and weight). The variety in the observed data is significant. For example
These are all quite similar but I think, maybe naively and if so please do prove me wrong, quite hard to extract good DSL rules to generate things like this.
Similarly, it would be very interesting if one could actually 'train' the DSL rules based on an input dataset, again using word vectors.
Apologies for the long post and if this is the wrong forum for this. I think these tools are crucial for NLU and I'm just looking for ways to extend their applicability.
The text was updated successfully, but these errors were encountered: