This is my implementation of the n-gram language model, as described by Andrej Karpathy in his makemore series.
An n-gram language model works by assigning probabilities to each n-gram encountered in the training set and predicting the next letter in a sequence, based on the probabilities calculated earlier; a bigram model is a generalization of the n-gram one.
- Create a string-to-integer tokenizer (essentially a converter from strings to integers) and a reverse, integer-to-string mapping;
- Put all the counts of the bigrams, encountered in the data set into a tensor (counts_tensor). The position of the count of a bigram is determined by the mapping from that bigram to integers. Each row, column label is a letter (or the special '.' token), and each other entry is the count of the combination of the two symbols at a given position;
- Normalizing the tensor by dividing each entry by the total number of counts so that a tensor with probabilities is produced;
- Iteratively sampling from the tensor by first sampling from the starting token row, then determining the current row by the lastly sampled letter (i.e., we sample the letter 'a' -- we switch to the row with label 'a', sample from the row...);
- Outputting the sampled characters, which are supposed to resemble real names;
- Calculating the negative log likelihood of all the bigrams, thus estimating the quality of the model (the lower the negative log likelihood -- the better the model is).
The idea of a trigram model is very much like that of the bigram one, the difference being that the former takes 2 letters and predicts the third one by utilizing the same sampling technique. In combination with an NN-based lookup table, the model is able to achieve a loss much lower and word generation results much more impressive than in the case of a bigram model.
#ToDo Further goals include experimenting with splitting the data, adding more layers to the network and hacking the code in arbitrary ways.