Predict next word with Python
Keyboards are our part of life. we use it in every computing environment. To reduce our effort in typing most of the keyboards today give advanced prediction facilities. it predicts the next character, or next word or even it can autocomplete the entire sentence.
In this article, we predict the next word with Python and machine learning.
Now let’s import the required libraries.
import numpy as np from nltk.tokenize import RegexpTokenizer from keras.models import Sequential, load_model from keras.layers import LSTM from keras.layers.core import Dense, Activation from keras.optimizers import RMSprop import matplotlib.pyplot as plt import pickle import heapq
Using TensorFlow backend. /home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint8 = np.dtype([("qint8", np.int8, 1)]) /home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint8 = np.dtype([("quint8", np.uint8, 1)]) /home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint16 = np.dtype([("qint16", np.int16, 1)]) /home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_quint16 = np.dtype([("quint16", np.uint16, 1)]) /home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. _np_qint32 = np.dtype([("qint32", np.int32, 1)]) /home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. np_resource = np.dtype([("resource", np.ubyte, 1)])
Now load the dataset:
text = open('1661-0.txt').read().lower() print('corpus length:', len(text))
corpus length: 581889
Now, we want to split the entire dataset into each word in order without the presence of special characters.
tokenizer = RegexpTokenizer(r'\w+') words = tokenizer.tokenize(text)
Next, for the feature engineering part, we need to have the unique sorted words list. We also need a dictionary with each word form the unique_words list as key and its corresponding position as value.
unique_words = np.unique(words) unique_word_index = dict((c, i) for i, c in enumerate(unique_words))
Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.
WORD_LENGTH = 5 prev_words = [] next_words = [] for i in range(len(words) - WORD_LENGTH): prev_words.append(words[i:i + WORD_LENGTH]) next_words.append(words[i + WORD_LENGTH]) print(prev_words[0]) print(next_words[0])
['project', 'gutenberg', 's', 'the', 'adventures'] of
Here, we create two numpy array X(for storing the features) and Y(for storing the corresponding label).
X = np.zeros((len(prev_words), WORD_LENGTH, len(unique_words)), dtype=bool) Y = np.zeros((len(next_words), len(unique_words)), dtype=bool)
We iterate X and Y if the word is present then the corresponding position is made 1.
for i, each_words in enumerate(prev_words): for j, each_word in enumerate(each_words): X[i, j, unique_word_index[each_word]] = 1 Y[i, unique_word_index[next_words[i]]] = 1
Let’s look at a single sequence:
print(X[0][0])
[False False False ... False False False]
We use a single-layer LSTM model with 128 neurons, a fully connected layer, and a softmax function for activation.
model = Sequential() model.add(LSTM(128, input_shape=(WORD_LENGTH, len(unique_words)))) model.add(Dense(len(unique_words))) model.add(Activation('softmax'))
The model will be trained with 20 epochs with an RMSprop optimizer.
optimizer = RMSprop(lr=0.01) model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) history = model.fit(X, Y, validation_split=0.05, batch_size=128, epochs=20, shuffle=True).history
Train on 103759 samples, validate on 5462 samples Epoch 1/20 103759/103759 [==============================] - 463s 4ms/step - loss: 6.0041 - acc: 0.1085 - val_loss: 7.1092 - val_acc: 0.1016 Epoch 2/20 103759/103759 [==============================] - 465s 4ms/step - loss: 5.7047 - acc: 0.1499 - val_loss: 7.8719 - val_acc: 0.1104 Epoch 3/20 103759/103759 [==============================] - 466s 4ms/step - loss: 5.7338 - acc: 0.1774 - val_loss: 8.1272 - val_acc: 0.1111 Epoch 4/20 103759/103759 [==============================] - 466s 4ms/step - loss: 5.5189 - acc: 0.2122 - val_loss: 8.2662 - val_acc: 0.0965 Epoch 5/20 103759/103759 [==============================] - 468s 5ms/step - loss: 5.2698 - acc: 0.2531 - val_loss: 8.4066 - val_acc: 0.0947 Epoch 6/20 103759/103759 [==============================] - 469s 5ms/step - loss: 4.9910 - acc: 0.2993 - val_loss: 8.4685 - val_acc: 0.0848 Epoch 7/20 103759/103759 [==============================] - 470s 5ms/step - loss: 4.7476 - acc: 0.3437 - val_loss: 8.6977 - val_acc: 0.0818 Epoch 8/20 103759/103759 [==============================] - 468s 5ms/step - loss: 4.5374 - acc: 0.3848 - val_loss: 8.8019 - val_acc: 0.0802 Epoch 9/20 103759/103759 [==============================] - 469s 5ms/step - loss: 4.3463 - acc: 0.4223 - val_loss: 8.9855 - val_acc: 0.0738 Epoch 10/20 103759/103759 [==============================] - 468s 5ms/step - loss: 4.1888 - acc: 0.4542 - val_loss: 9.2139 - val_acc: 0.0705 Epoch 11/20 103759/103759 [==============================] - 468s 5ms/step - loss: 4.0692 - acc: 0.4817 - val_loss: 9.2997 - val_acc: 0.0729 Epoch 12/20 103759/103759 [==============================] - 468s 5ms/step - loss: 3.9562 - acc: 0.5051 - val_loss: 9.4412 - val_acc: 0.0655 Epoch 13/20 103759/103759 [==============================] - 469s 5ms/step - loss: 3.8481 - acc: 0.5244 - val_loss: 9.5033 - val_acc: 0.0685 Epoch 14/20 103759/103759 [==============================] - 468s 5ms/step - loss: 3.7430 - acc: 0.5439 - val_loss: 9.5850 - val_acc: 0.0654 Epoch 15/20 103759/103759 [==============================] - 467s 5ms/step - loss: 3.6554 - acc: 0.5593 - val_loss: 9.6256 - val_acc: 0.0646 Epoch 16/20 103759/103759 [==============================] - 467s 5ms/step - loss: 3.5782 - acc: 0.5739 - val_loss: 9.7318 - val_acc: 0.0624 Epoch 17/20 103759/103759 [==============================] - 469s 5ms/step - loss: 3.5137 - acc: 0.5858 - val_loss: 9.7691 - val_acc: 0.0602 Epoch 18/20 103759/103759 [==============================] - 469s 5ms/step - loss: 3.4569 - acc: 0.5951 - val_loss: 9.7911 - val_acc: 0.0621 Epoch 19/20 103759/103759 [==============================] - 469s 5ms/step - loss: 3.4045 - acc: 0.6035 - val_loss: 9.7896 - val_acc: 0.0595 Epoch 20/20 103759/103759 [==============================] - 469s 5ms/step - loss: 3.3596 - acc: 0.6116 - val_loss: 9.8724 - val_acc: 0.0601
After successful training, we will save the trained model and just load it back as needed.
model.save('keras_next_word_model.h5') pickle.dump(history, open("history.p", "wb")) model = load_model('keras_next_word_model.h5') history = pickle.load(open("history.p", "rb"))
Now, we need to predict new words using this model. To do that we input the sample as a feature vector. we convert the input string to a single feature vector.
def prepare_input(text): x = np.zeros((1, WORD_LENGTH, len(unique_words))) for t, word in enumerate(text.split()): print(word) x[0, t, unique_word_index[word]] = 1 return x prepare_input("It is not a lack".lower())
it is not a lack
array([[[0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.], [0., 0., 0., ..., 0., 0., 0.]]])
To choose the best possible n words after the prediction from the model is done by sample function.
def sample(preds, top_n=3): preds = np.asarray(preds).astype('float64') preds = np.log(preds) exp_preds = np.exp(preds) preds = exp_preds / np.sum(exp_preds) return heapq.nlargest(top_n, range(len(preds)), preds.take)
Finally, for prediction, we use the function predict_completions which use the model to predict and return the list of n predicted words.
def predict_completions(text, n=3): if text == "": return("0") x = prepare_input(text) preds = model.predict(x, verbose=0)[0] next_indices = sample(preds, n) return [unique_words[idx] for idx in next_indices]
Now let’s see how it predicts, we use tokenizer.tokenize fo removing the punctuations and also we choose 5 first words because our predicts base on 5 previous words.
q = "Your life will never be there in the same situation again" print("correct sentence: ",q) seq = " ".join(tokenizer.tokenize(q.lower())[0:5]) print("Sequence: ",seq) print("next possible words: ", predict_completions(seq, 5))