Build Autocorrect in Python
Have you ever thought about how the autocorrect features works in the keyboard of a smartphone? Almost every smartphone brand irrespective of its price provides an autocorrect feature in their keyboards today. So let’s understand how the autocorrect features works. In this article, I will take you through how to build autocorrect with Python.
With the context of Machine Learning, autocorrect is based on Natural Language Processing. As the name suggests it is programmed to correct spellings and errors while typing.
Here I am using text from a book as Dataset named as Data.txt and Build Autocorrect in Python
First we call all necessary libraries :
import pandas as pd import numpy as np import textdistance import re from collections import Counter
Now we read the Data.txt file as f:
words = [] with open('Data.txt', 'r') as f: file_name_data = f.read() file_name_data=file_name_data.lower() words = re.findall('\w+',file_name_data) # This is our vocabulary V = set(words) print(f"The first ten words in the text are: \n{words[0:10]}") print(f"There are {len(V)} unique words in the vocabulary.")
The first ten words in the text are: ['the', 'project', 'gutenberg', 'ebook', 'of', 'moby', 'dick', 'or', 'the', 'whale'] There are 17647 unique words in the vocabulary.
In the above code, we made a list of words, and now we need to build the frequency of those words, which can be easily done by using the counter function in Python :
word_freq_dict = {} word_freq_dict = Counter(words) print(word_freq_dict.most_common()[0:10])
[('the', 14703), ('of', 6742), ('and', 6517), ('a', 4799), ('to', 4707), ('in', 4238), ('that', 3081), ('it', 2534), ('his', 2530), ('i', 2120)]
Now we want to get the probability of occurrence of each word, this equals the relative frequencies of the words:
probs = {} Total = sum(word_freq_dict.values()) for k in word_freq_dict.keys(): probs[k] = word_freq_dict[k]/Total
Now we will sort similar words according to the Jaccard distance by calculating the 2 grams Q of the words. Next, we will return the 5 most similar words ordered by similarity and probability:
def my_autocorrect(input_word): input_word = input_word.lower() if input_word in V: return('Your word seems to be correct') else: similarities = [1-(textdistance.Jaccard(qval=2).distance(v,input_word)) for v in word_freq_dict.keys()] df = pd.DataFrame.from_dict(probs, orient='index').reset_index() df = df.rename(columns={'index':'Word', 0:'Prob'}) df['Similarity'] = similarities output = df.sort_values(['Similarity', 'Prob'], ascending=False).head() return(output)
Let's check for similarity of word 'nevertheless' from the set of words :
my_autocorrect('nevertheless')
Here now we check for the wrong spelled word 'nevrtless' and it return the 5 most similar words ordered by similarity and probability.
my_autocorrect('nevrtless')
Word | Prob | Similarity | |
---|---|---|---|
2571 | nevertheless | 0.000225 | 0.461538 |
10481 | heartless | 0.000018 | 0.454545 |
13600 | nestle | 0.000004 | 0.444444 |
16146 | heartlessness | 0.000004 | 0.428571 |
12513 | subtleness | 0.000004 | 0.416667 |