Some scripts for training a neural network language model on tweets.
TwitNet is a colection of python scripts designed that allow to easily create RNN language models to train on tweet data and sample tweets from. The scripts handle basic preprocessing, creating models of different size and architecture, training these models and sampling from them. TwitNet is built on top of the Keras library for neural networks in python, which in itself uses either TensorFlow or Theano as backend for tensor operations.
TwitNet was mainly developed with word based language models in mind, where the preprocessing takes care of replacing links and twitter usernames with special tokens so that these words don't clutter up the vocabulary. But there is also support for character based language models with more basic preprocessing.
I have trained some networks on the corpus of all of Donald Trump's tweets up until about November 2016. The tweets were obtained from this site. There are two files with model parameters in the pretrained directory:
- trump_word.mod is a word based language model with a vocabulary of 3000 words. The network is a 3 layer LSTM with 256 neurons each. To sample from this model, run
python3 preprocess.py -v 3000
python3 sample.py -m pretrained/trump_word.mod
- trump_char.mod is a character based language model (3 layer LSTM with 512 units each). To sample from this model, run
python3 preprocess_char.py
python3 sample_char.py -m pretrained/trump_char.mod
Here are some of my favourite outputs from the char based model (the part in brackets is what was fed to the network as beginning of the tweet):
- [The united nations] yesterday poll. God please do this, losers.
- [Hillary Clinton] is biggest breakdown w/ all the disgracees...." No, thanks!
- [Americans should vote for] Mr. Trump, that's working and bring more #Trump2016" Great
- [The mexican border] is strong and strong and surprised.)--Trump 2016"
- [We have to make] things cutting on television. The Pennsylvania Bush can do it.
- [Watch the interview] with Governor Pastor. Go to Jeb. ...@realDonaldTrump https://t.co/SyzcrCyG
Quite to my surprise the character based network which was only trained on Donald Trump Tweets is able to produce english text with only few spelling or grammar mistakes. It also learns to occasionally use abbreviations, hashtags, and even create random looking URL shortener links! Punctuation is somewhat of a problem: while dots, bangs and commas are placed neatly, quotes and parantheses seem to be more difficult. This might be because of the limited backpropagation through time (in this case 40 characters) which prohibits to learn very long-term dependencies.
And here are some interesting tweets generated by the word based model:
- [the united nations] people who amazing @@realDonaldTrump will be on all of our country at my for cause video , and we just will win!
- [hillary clinton] shots yourself solutions announces https://t.co/owXU5vwh
- [americans should vote for] the first a great people. win -- - beginning & & make our country great for your country failed of the deal.
- [the mexican border] economy or charity should wouldn't the wall. they don't have a business and fight how bad written got not from china.
- [we have to make] america great again! make america great again!
- [watch the interview] on careful discussing the trump st -- -- the all person joke was on. an many shot leadership for the keep ago.
As you can see, the sampled tweets seem to be much less coherent than those of the character based model, the grammar is pretty off. In general my experiments so far had the character based models outperforming the word based ones. Even when using a relatively small vocabulary like 3000 and doing the tweet preprocessing steps(such as ignoring case, replacing links etc.), there are still many words that occur only very infrequently in the training data - a problem which can not occur in the character based approach, which seems to be superior for this specific problem.
- Install the necessary dependencies: Numpy, TensorFlow and Keras. See Installation for details.
- Gather a corpus of tweets and store them as data/tweets.txt in the format of one tweet per line.
- To create and train a word based language model with the default parameters (vocabulary size of 4000 and 2 layer LSTM with 128 units each), run
python3 preprocess.py
python3 create_model.py
python3 train_model.py
This will train the model for 50 epochs, print the loss on 10% of the training data as validation data and save the parameters after every epoch.
- You can then sample tweets from the model with the optimal loss by running
python3 sample.py -m model/model_epoch<n>.mod
where <n> is the number of the epoch after which the loss was minimal.
- To train a character based model, run
python preprocess_char.py
instead ofpython preprocess.py
andpython sample_char.py
instead ofpython sample.py
.
- Install Python 3.
- Install Numpy.
- Install TensorFlow.
- Install Keras (
pip3 install keras
). - (Optional, but recommended!) Install the latest (developer) version of Theano and setup Keras to use Theano as backend. This will greatly decrease training time as Theano has better performance training recurrent neural networks. If you have a GPU, I would also recommend to install CUDA and setup Theano to work with CUDA
- Clone this repository. You should now be able to run the scripts.
Preproccesses a file of tweets, given in a text file in the format of one tweet per line. This splits up the tweets into words and (if not turned off by the flag) performs some more preprocessing steps:
- Every link is replaced by the <link> token, every number by the <number> token, and every twitter username by the <user> token. Hashtags are not replaced.
The idea behind this is that hashtags might add some significant meaning to the tweets, while concrete links, numbers or usernames are less important. This is of course debatable, and this preprocessing step can be turned off - in this case it might be necessary to also increase the vocabulary size.
The replaced links, names and numbers are stored in an additional tweet data file tweet_data.pickle
in the same directory as the input tweet file. Tokens appearing multiple times will also be stored with their multiplicity. During sampling, the tokens are replaced by samples chosen uniformly at random from this stored data.
- The vocabulary is limited to the most frequent words, where the vocabulary size is given as command line argument (default: 4000). Every word not in the vocabulary is replaced by the <unknown> token. The vocabulary is stored as
vocab.txt
in the same directory as the training data. - Very short tweets with less than the given minimum length are removed.
- Words are mapped to indices and the tweets are stored as training sequences for the language model. The training data is stored as
training_data.npz
in the same directory as the tweet input data.
Short name | Long name | Argument type | Default value | Description |
---|---|---|---|---|
-i |
--input_file |
String | "data/tweets.txt" | Path to tweet input file. |
-v |
--vocab_size |
Integer | 4000 | Size of the vocabulary. |
-m |
--min_length |
Integer | 3 | Minimum word length of a tweet. |
-c |
--case_sensitive |
Flag | False | If set, handle words case-sensitive. |
-u |
--tokens_unchanged |
Flag | False | If set, do not replace individual links, usernames and numbers. |
-h |
--help |
Flag | False | Print help text. |
Preprocesses a file of tweets in the same input format as for preprocess.py
, but training data is created for training a character based language model. For this, the whole tweet corpus is simply concatenated and split into fixed length character sequences. No replacement of links etc. is performed and the vocabulary is not limited, since it will typically be small(< 100).
An important parameter is the history length, which determines the length of training sequences and thus the maximum number of backpropagation through time steps. The default history length is 40 characters.
Short name | Long name | Argument type | Default value | Description |
---|---|---|---|---|
-i |
--input_file |
String | "data/tweets.txt" | Path to tweet input file. |
-l |
--history_length |
Integer | 40 | Maximum length of char history used for backpropagation through time steps. |
-h |
--help |
Flag | False | Print help text. |
Creates a keras recurrent neural network model with the parameters given on the command line. The structure of the network is
- One initial embedding layer from the vocabulary size to the hidden size. This layer is omitted if the vocabulary size is not at least five times as large as the hidden size, which typically happens in character based models.
- A specified number of recurrent layers(default: 2) with a specified number of hidden units per layer(default: 128). The default architecture is LSTM, but RNNs and GRUs are also available.
- Optinally dropout is performed after each recurrent layer with given retention probability.
- A final dense layer from the hidden size to the vocabulary size.
As optimizer either Adam or RMSProp can be chosen and the initial learning rate can be specified. You can optimize the model for training on a GPU. The model is compiled and stored at the specified output location.
Short name | Long name | Argument type | Default value | Description |
---|---|---|---|---|
-o |
--output_file |
String | "model/model.mod" | Path to output file for model. |
-v |
--vocab_file |
String | "data/vocab.txt" | Path to vocabulary file. |
-t |
--layer_type |
"rnn"/"gru"/"lstm" | "lstm" | Type of recurrent layer to use. |
-n |
--hidden_num |
Integer | 2 | Number of hidden layers. |
-s |
--hidden_size |
Integer | 128 | Number of neurons per hidden layer. |
-a |
--optimizer |
"adam"/"rmsprop" | "adam" | Optimizer to use. |
-l |
--learning_rate |
Float | 0.001 | Initial learning rate for the optimizer. |
-d |
--dropout |
Float | 0.0 | If set to 0 < p < 1: apply dropout with retention probability p after each recurrent layer. |
-g |
--gpu |
Flag | False | Optimize network for GPU training. |
-h |
--help |
Flag | False | Print help text. |
Trains a given model for a specified number of epochs. For this, the training data is split into training and validation data, and the loss is evaluated regularly on the validation data. The model parameters are also saved regularly. If desired, the model can be trained on the complete training data, with the potential danger of overfitting - the loss will then be evaluated on a part of the training data. The learning rate is adjusted automatically once the loss stops decreasing.
Short name | Long name | Argument type | Default value | Description |
---|---|---|---|---|
-m |
--model_file |
String | "model/model.mod" | Path to model file. |
-v |
--vocab_file |
String | "data/vocab.txt" | Path to vocabulary file. |
-t |
--training_file |
String | "data/training_data.npz" | Path to training data file. |
-e |
--epochs |
Integer | 50 | Number of epochs to train. |
-b |
--batchsize |
Integer | 32 | Minibatch size for training. |
-s |
--save_every |
Integer | 1 | Save model parameters after every n epochs. |
-l |
--evaluate_every |
Integer | 1 | Evaluate loss after every n epochs. |
-p |
--validation_split |
Float | 0.1 | Part of training data used as validation data for evaluating loss. |
-a |
--train_on_all |
Flag | False | If set, train on all examples, including the validation samples. Might lead to overfitting. |
-h |
--help |
Flag | False | Print help text. |
Samples tweets from a trained word based language model. For this purpose the user can supply an initial sequence to the model which is then completed into a number of tweets of specified maximum length.
If special tokens for links etc. where created during preprocessing, they are replaced with values sampled uniformly at random from the stored tweet data. Unless specified by the corresponding flag, no <unknown> tokens will be sampled. One can also experiment with a temperature argument for sampling, where a temperature < 1 will lead to a less random output. Note that it will initially take several seconds to load the model parameters.
Short name | Long name | Argument type | Default value | Description |
---|---|---|---|---|
-m |
--model_file |
String | "model/model.mod" | Path to model file. |
-v |
--vocab_file |
String | "data/vocab.txt" | Path to vocabulary file. |
-d |
--data_file |
String | "data/tweet_data.pickle" | Path to tweet data file. |
-n |
--samples_number |
Integer | 3 | Number of samples to create for each user input. |
-l |
--max_length |
Integer | 32 | Maximum number of words in a sampled tweet. |
-t |
--temperature |
Float | 1.0 | Temperature for sampling from the network's output distribution. |
-u |
--sample_unknown |
Flag | False | If set, allow sampling the <unknown> token in tweets. |
-h |
--help |
Flag | False | Print help text. |
Like sample.char
, except that tweets are sampled from a character based language model.
Short name | Long name | Argument type | Default value | Description |
---|---|---|---|---|
-m |
--model_file |
String | "model/model.mod" | Path to model file. |
-v |
--vocab_file |
String | "data/vocab.txt" | Path to vocabulary file. |
-n |
--samples_number |
Integer | 3 | Number of samples to create for each user input. |
-l |
--max_length |
Integer | 140 | Maximum number of characters in a sampled tweet. |
-t |
--temperature |
Float | 1.0 | Temperature for sampling from the network's output distribution. |
-h |
--help |
Flag | False | Print help text. |