• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Predict next word with Python

Keyboards are our part of life. we use it in every computing environment. To reduce our effort in typing most of the keyboards today give advanced prediction facilities. it predicts the next character, or next word or even it can autocomplete the entire sentence.

In this article, we predict the next word with Python and machine learning.

Now let’s import the required libraries.

In [1]:
import numpy as np
from nltk.tokenize import RegexpTokenizer
from keras.models import Sequential, load_model
from keras.layers import LSTM
from keras.layers.core import Dense, Activation
from keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import pickle
import heapq
Using TensorFlow backend.
/home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:493: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:494: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:495: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:496: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:497: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/webtunix/.local/lib/python3.5/site-packages/tensorflow/python/framework/dtypes.py:502: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])

Now load the dataset:

In [2]:
text = open('1661-0.txt').read().lower()
print('corpus length:', len(text))
corpus length: 581889

Now, we want to split the entire dataset into each word in order without the presence of special characters.

In [3]:
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text)

Next, for the feature engineering part, we need to have the unique sorted words list. We also need a dictionary with each word form the unique_words list as key and its corresponding position as value.

In [5]:
unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))

Feature engineering :

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

In [6]:
WORD_LENGTH = 5
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
    prev_words.append(words[i:i + WORD_LENGTH])
    next_words.append(words[i + WORD_LENGTH])
print(prev_words[0])
print(next_words[0])
['project', 'gutenberg', 's', 'the', 'adventures']
of

Here, we create two numpy array X(for storing the features) and Y(for storing the corresponding label).

In [7]:
X = np.zeros((len(prev_words), WORD_LENGTH, len(unique_words)), dtype=bool)
Y = np.zeros((len(next_words), len(unique_words)), dtype=bool)

We iterate X and Y if the word is present then the corresponding position is made 1.

In [8]:
for i, each_words in enumerate(prev_words):
    for j, each_word in enumerate(each_words):
        X[i, j, unique_word_index[each_word]] = 1
    Y[i, unique_word_index[next_words[i]]] = 1

Let’s look at a single sequence:

In [9]:
print(X[0][0])
[False False False ... False False False]

Building the model

We use a single-layer LSTM model with 128 neurons, a fully connected layer, and a softmax function for activation.

In [10]:
model = Sequential()
model.add(LSTM(128, input_shape=(WORD_LENGTH, len(unique_words))))
model.add(Dense(len(unique_words)))
model.add(Activation('softmax'))

Training

The model will be trained with 20 epochs with an RMSprop optimizer.

In [11]:
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
history = model.fit(X, Y, validation_split=0.05, batch_size=128, epochs=20, shuffle=True).history
Train on 103759 samples, validate on 5462 samples
Epoch 1/20
103759/103759 [==============================] - 463s 4ms/step - loss: 6.0041 - acc: 0.1085 - val_loss: 7.1092 - val_acc: 0.1016
Epoch 2/20
103759/103759 [==============================] - 465s 4ms/step - loss: 5.7047 - acc: 0.1499 - val_loss: 7.8719 - val_acc: 0.1104
Epoch 3/20
103759/103759 [==============================] - 466s 4ms/step - loss: 5.7338 - acc: 0.1774 - val_loss: 8.1272 - val_acc: 0.1111
Epoch 4/20
103759/103759 [==============================] - 466s 4ms/step - loss: 5.5189 - acc: 0.2122 - val_loss: 8.2662 - val_acc: 0.0965
Epoch 5/20
103759/103759 [==============================] - 468s 5ms/step - loss: 5.2698 - acc: 0.2531 - val_loss: 8.4066 - val_acc: 0.0947
Epoch 6/20
103759/103759 [==============================] - 469s 5ms/step - loss: 4.9910 - acc: 0.2993 - val_loss: 8.4685 - val_acc: 0.0848
Epoch 7/20
103759/103759 [==============================] - 470s 5ms/step - loss: 4.7476 - acc: 0.3437 - val_loss: 8.6977 - val_acc: 0.0818
Epoch 8/20
103759/103759 [==============================] - 468s 5ms/step - loss: 4.5374 - acc: 0.3848 - val_loss: 8.8019 - val_acc: 0.0802
Epoch 9/20
103759/103759 [==============================] - 469s 5ms/step - loss: 4.3463 - acc: 0.4223 - val_loss: 8.9855 - val_acc: 0.0738
Epoch 10/20
103759/103759 [==============================] - 468s 5ms/step - loss: 4.1888 - acc: 0.4542 - val_loss: 9.2139 - val_acc: 0.0705
Epoch 11/20
103759/103759 [==============================] - 468s 5ms/step - loss: 4.0692 - acc: 0.4817 - val_loss: 9.2997 - val_acc: 0.0729
Epoch 12/20
103759/103759 [==============================] - 468s 5ms/step - loss: 3.9562 - acc: 0.5051 - val_loss: 9.4412 - val_acc: 0.0655
Epoch 13/20
103759/103759 [==============================] - 469s 5ms/step - loss: 3.8481 - acc: 0.5244 - val_loss: 9.5033 - val_acc: 0.0685
Epoch 14/20
103759/103759 [==============================] - 468s 5ms/step - loss: 3.7430 - acc: 0.5439 - val_loss: 9.5850 - val_acc: 0.0654
Epoch 15/20
103759/103759 [==============================] - 467s 5ms/step - loss: 3.6554 - acc: 0.5593 - val_loss: 9.6256 - val_acc: 0.0646
Epoch 16/20
103759/103759 [==============================] - 467s 5ms/step - loss: 3.5782 - acc: 0.5739 - val_loss: 9.7318 - val_acc: 0.0624
Epoch 17/20
103759/103759 [==============================] - 469s 5ms/step - loss: 3.5137 - acc: 0.5858 - val_loss: 9.7691 - val_acc: 0.0602
Epoch 18/20
103759/103759 [==============================] - 469s 5ms/step - loss: 3.4569 - acc: 0.5951 - val_loss: 9.7911 - val_acc: 0.0621
Epoch 19/20
103759/103759 [==============================] - 469s 5ms/step - loss: 3.4045 - acc: 0.6035 - val_loss: 9.7896 - val_acc: 0.0595
Epoch 20/20
103759/103759 [==============================] - 469s 5ms/step - loss: 3.3596 - acc: 0.6116 - val_loss: 9.8724 - val_acc: 0.0601

After successful training, we will save the trained model and just load it back as needed.

In [12]:
model.save('keras_next_word_model.h5')
pickle.dump(history, open("history.p", "wb"))

model = load_model('keras_next_word_model.h5')
history = pickle.load(open("history.p", "rb"))

Prediction

Now, we need to predict new words using this model. To do that we input the sample as a feature vector. we convert the input string to a single feature vector.

In [13]:
def prepare_input(text):
    x = np.zeros((1, WORD_LENGTH, len(unique_words)))
    for t, word in enumerate(text.split()):
        print(word)
        x[0, t, unique_word_index[word]] = 1
    return x
prepare_input("It is not a lack".lower())
it
is
not
a
lack
Out[13]:
array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

To choose the best possible n words after the prediction from the model is done by sample function.

In [14]:
def sample(preds, top_n=3):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds)
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    return heapq.nlargest(top_n, range(len(preds)), preds.take)
                                   

Finally, for prediction, we use the function predict_completions which use the model to predict and return the list of n predicted words.

In [15]:
def predict_completions(text, n=3):
    if text == "":
        return("0")
    x = prepare_input(text)
    preds = model.predict(x, verbose=0)[0]
    next_indices = sample(preds, n)
    return [unique_words[idx] for idx in next_indices]

Now let’s see how it predicts, we use tokenizer.tokenize fo removing the punctuations and also we choose 5 first words because our predicts base on 5 previous words.

In [17]:
q =  "Your life will never be there in the same situation again"
print("correct sentence: ",q)
seq = " ".join(tokenizer.tokenize(q.lower())[0:5])
print("Sequence: ",seq)
print("next possible words: ", predict_completions(seq, 5))
correct sentence:  Your life will never be there in the same situation again
Sequence:  your life will never be
your
life
will
never
be
next possible words:  ['there', 'of', 'no', 'she', 'yourself']
In [ ]:

Resources You Will Ever Need