Language Classification with Machine Learning Using Python

The language classification is the grouping of associated languages in the same category. Languages are grouped diachronically into language families. In other words, languages are grouped according to their development and evolution throughout history, with languages that descend from a common ancestor being grouped in the same language family.

So let’s start with the task of language classification with Machine Learning using the python programming language by importing all the modules and packages needed for this task:

In [1]:
import numpy as np # For arithmetics and arrays
import math # For inbuilt math functions
import pandas as pd # For handling data frames
import collections # used for dictionaries and counters
from itertools import permutations # used to find permutations

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function to easily split data into training and testing samples
from sklearn.decomposition import PCA # Principal component analysis used to reduce the number of features in a model
from sklearn.preprocessing import StandardScaler # used to scale data to be used in the model
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss

import pickle # To save the trained model and then read it

import seaborn as sns # Create plots
import matplotlib.pyplot as plt

Now, let’s import and clean our data.

In [2]:
df = pd.read_csv('lang_data.csv') # Read raw data
df = df.dropna() # remove null values for the "text" column
df['text'] = df['text'].astype(str) # Convert the column "text" from object to a string in order to operate on it
df['language'] = df['language'].astype(str)

Language Classification: Feature Creation

I will now create a new set of features that will be used in the language classification model to classify text into three languages: English, Afrikaans, and Dutch. As mentioned, different features may be more effective in classifying other languages:

In [3]:
# Define a list of commonly found punctuations
punc = ('!', "," ,"\'" ,";" ,"\"", ".", "-" ,"?")
# Define a list of double consecutive vowels which are typically found in Dutch and Afrikaans languages
same_consecutive_vowels = ['aa','ee', 'ii', 'oo', 'uu']
consecutive_vowels = [''.join(p) for p in permutations(vowels,2)]
dutch_combos = ['ij']

# Create a pre-defined set of features based on the "text" column in order to allow us to characterize the string
df['word_count'] = df['text'].apply(lambda x : len(x.split()))
df['character_count'] = df['text'].apply(lambda x : len(x.replace(" ","")))
df['word_density'] = df['word_count'] / (df['character_count'] + 1)
df['punc_count'] = df['text'].apply(lambda x : len([a for a in x if a in punc]))
df['v_char_count'] = df['text'].apply(lambda x : len([a for a in x if a.casefold() == 'v']))
df['w_char_count'] = df['text'].apply(lambda x : len([a for a in x if a.casefold() == 'w']))
df['ij_char_count'] = df['text'].apply(lambda x : sum([any(d_c in a for d_c in dutch_combos) for a in x.split()]))
df['num_double_consec_vowels'] = df['text'].apply(lambda x : sum([any(c_v in a for c_v in same_consecutive_vowels) for a in x.split()]))
df['num_consec_vowels'] = df['text'].apply(lambda x : sum([any(c_v in a for c_v in consecutive_vowels) for a in x.split()]))
df['num_vowels'] = df['text'].apply(lambda x : sum([any(v in a for v in vowels) for a in x.split()]))
df['vowel_density'] = df['num_vowels']/df['word_count']
df['capitals'] = df['text'].apply(lambda comment: sum(1 for c in comment if c.isupper()))
df['caps_vs_length'] = df.apply(lambda row: float(row['capitals'])/float(row['character_count']),axis=1)
df['num_exclamation_marks'] =df['text'].apply(lambda x: x.count('!'))
df['num_question_marks'] = df['text'].apply(lambda x: x.count('?'))
df['num_punctuation'] = df['text'].apply(lambda x: sum(x.count(w) for w in punc))
df['num_unique_words'] = df['text'].apply(lambda x: len(set(w for w in x.split())))
df['num_repeated_words'] = df['text'].apply(lambda x: len([w for w in collections.Counter(x.split()).values() if w > 1]))
df['words_vs_unique'] = df['num_unique_words'] / df['word_count']
df['encode_ascii'] = np.nan
for i in range(len(df)):
    except UnicodeDecodeError:
        df['encode_ascii'].iloc[i] = 0
        df['encode_ascii'].iloc[i] = 1
Language Classification: Summarizing Features

After building the above feature set, we can calculate averages of these features by language to check if there are any obvious significant differences. To do this, simply run the command below:

In [4]:
language Afrikaans English Nederlands
word_count 10.503912 4.072506 5.746269
character_count 43.541471 16.841849 26.074627
word_density 0.234060 0.226490 0.209378
punc_count 1.507042 0.317275 1.223881
v_char_count 0.652582 0.126521 0.358209
w_char_count 0.904538 0.291971 0.522388
ij_char_count 0.000000 0.000000 0.268657
num_double_consec_vowels 1.696401 0.178589 1.014925
num_consec_vowels 2.773083 0.536253 1.134328
num_vowels 9.114241 3.742579 5.671642
vowel_density 0.861087 0.930949 0.990534
capitals 1.510172 1.193674 1.014925
caps_vs_length 0.040771 0.082882 0.042959
num_exclamation_marks 0.054773 0.003406 0.000000
num_question_marks 0.014085 0.010219 0.000000
num_punctuation 1.507042 0.317275 1.223881
num_unique_words 9.543036 3.992214 5.567164
num_repeated_words 0.787167 0.076399 0.179104
words_vs_unique 0.948318 0.990175 0.978167
encode_ascii 0.597809 0.997567 0.955224

Looking at the first feature, for example, word_count, we can notice that Afrikaans sentences are likely to be made up of more words than English and Dutch.

Language Classification: Correlation

Next, we need to look at the degree of correlation between the characteristics we have created. The idea behind correlation with the context of our task it that if two or more characteristics are strongly correlated with each other, then it is likely that they will have very similar explanatory power when classifying languages.

As such, we can only keep one of these features and get the same predictive power from our model. To calculate the correlation matrix :

In [5]:
df.corr(method ='pearson')
word_count character_count word_density punc_count v_char_count w_char_count ij_char_count num_double_consec_vowels num_consec_vowels num_vowels vowel_density capitals caps_vs_length num_exclamation_marks num_question_marks num_punctuation num_unique_words num_repeated_words words_vs_unique encode_ascii
word_count 1.000000 0.963818 0.284142 0.656144 0.499937 0.576566 0.002617 0.714460 0.769932 0.985911 -0.158172 0.408048 -0.449626 0.161356 0.054530 0.656144 0.985286 0.785662 -0.609634 -0.500432
character_count 0.963818 1.000000 0.066516 0.685772 0.535066 0.579176 0.012066 0.737948 0.801404 0.960918 -0.089117 0.409393 -0.486624 0.174488 0.065903 0.685772 0.950250 0.751124 -0.570829 -0.506098
word_density 0.284142 0.066516 1.000000 0.001107 0.002239 0.061311 -0.032624 0.050620 0.042554 0.249241 -0.356830 0.081018 0.062398 -0.011030 0.001162 0.001107 0.306803 0.143736 -0.178659 -0.077037
punc_count 0.656144 0.685772 0.001107 1.000000 0.370564 0.374964 0.042030 0.548477 0.563451 0.621920 -0.228459 0.385474 -0.287332 0.205043 0.119390 1.000000 0.652952 0.487491 -0.379424 -0.392489
v_char_count 0.499937 0.535066 0.002239 0.370564 1.000000 0.233034 0.022992 0.389565 0.424615 0.499093 -0.038331 0.250590 -0.249199 0.095870 0.025231 0.370564 0.502898 0.367544 -0.287014 -0.337758
w_char_count 0.576566 0.579176 0.061311 0.374964 0.233034 1.000000 0.005225 0.468219 0.413632 0.580779 -0.002004 0.204286 -0.280512 0.111205 0.101982 0.374964 0.562223 0.464511 -0.353433 -0.298730
ij_char_count 0.002617 0.012066 -0.032624 0.042030 0.022992 0.005225 1.000000 0.000365 -0.010899 0.012548 0.047971 -0.021422 -0.048555 -0.007353 -0.007675 0.042030 0.007278 -0.016064 0.017089 0.023861
num_double_consec_vowels 0.714460 0.737948 0.050620 0.548477 0.389565 0.468219 0.000365 1.000000 0.588212 0.707235 -0.079424 0.245026 -0.352140 0.174573 0.040402 0.548477 0.705972 0.563045 -0.412084 -0.471362
num_consec_vowels 0.769932 0.801404 0.042554 0.563451 0.424615 0.413632 -0.010899 0.588212 1.000000 0.776344 -0.044574 0.286679 -0.401469 0.121268 0.033055 0.563451 0.759246 0.604320 -0.452598 -0.436547
num_vowels 0.985911 0.960918 0.249241 0.621920 0.499093 0.580779 0.012548 0.707235 0.776344 1.000000 -0.022570 0.357889 -0.480243 0.158128 0.057858 0.621920 0.968959 0.782950 -0.614507 -0.450493
vowel_density -0.158172 -0.089117 -0.356830 -0.228459 -0.038331 -0.002004 0.047971 -0.079424 -0.044574 -0.022570 1.000000 -0.196330 -0.119797 -0.011372 0.006199 -0.228459 -0.177521 -0.051552 0.036457 0.244971
capitals 0.408048 0.409393 0.081018 0.385474 0.250590 0.204286 -0.021422 0.245026 0.286679 0.357889 -0.196330 1.000000 0.282340 0.122447 0.082210 0.385474 0.399172 0.329163 -0.195691 -0.199190
caps_vs_length -0.449626 -0.486624 0.062398 -0.287332 -0.249199 -0.280512 -0.048555 -0.352140 -0.401469 -0.480243 -0.119797 0.282340 1.000000 -0.046046 -0.012805 -0.287332 -0.476189 -0.258355 0.282542 0.259936
num_exclamation_marks 0.161356 0.174488 -0.011030 0.205043 0.095870 0.111205 -0.007353 0.174573 0.121268 0.158128 -0.011372 0.122447 -0.046046 1.000000 0.104786 0.205043 0.162700 0.097104 -0.059455 -0.113380
num_question_marks 0.054530 0.065903 0.001162 0.119390 0.025231 0.101982 -0.007675 0.040402 0.033055 0.057858 0.006199 0.082210 -0.012805 0.104786 1.000000 0.119390 0.062724 0.015089 -0.016997 -0.037012
num_punctuation 0.656144 0.685772 0.001107 1.000000 0.370564 0.374964 0.042030 0.548477 0.563451 0.621920 -0.228459 0.385474 -0.287332 0.205043 0.119390 1.000000 0.652952 0.487491 -0.379424 -0.392489
num_unique_words 0.985286 0.950250 0.306803 0.652952 0.502898 0.562223 0.007278 0.705972 0.759246 0.968959 -0.177521 0.399172 -0.476189 0.162700 0.062724 0.652952 1.000000 0.679645 -0.509275 -0.511917
num_repeated_words 0.785662 0.751124 0.143736 0.487491 0.367544 0.464511 -0.016064 0.563045 0.604320 0.782950 -0.051552 0.329163 -0.258355 0.097104 0.015089 0.487491 0.679645 1.000000 -0.853010 -0.342473
words_vs_unique -0.609634 -0.570829 -0.178659 -0.379424 -0.287014 -0.353433 0.017089 -0.412084 -0.452598 -0.614507 0.036457 -0.195691 0.282542 -0.059455 -0.016997 -0.379424 -0.509275 -0.853010 1.000000 0.272145
encode_ascii -0.500432 -0.506098 -0.077037 -0.392489 -0.337758 -0.298730 0.023861 -0.471362 -0.436547 -0.450493 0.244971 -0.199190 0.259936 -0.113380 -0.037012 -0.392489 -0.511917 -0.342473 0.272145 1.000000

We can also visualize the pairwise correlation matrix using the following command :

In [6]:
Language Classification using Machine Learning

We can notice how several of the variables are strongly positively correlated. For example, word_count and character_count have a correlation of around 96%, which means they tell us roughly the same thing in terms of the length of a text for each language considered.

Language Classification: Splitting The Data

Before going any further in building our linguistic classification model, we need to divide the dataset into training and test sets :

In [7]:
#split dataset into features and target variable
feature_cols = list(df.columns)[2:]
X = df[feature_cols] # Features
y = df[['language']] # Target variable
# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% train and 20% test

Reducing Correlation

We should aim to use only the most unique characteristics in our classification models, as the correlated variables do not add much to the predictive power of the models.

One method used in machine learning to reduce the correlation between features is called principal component analysis or PCA:

In [8]:
# Standardize the data
scaler = StandardScaler()
# Fit on training set only.
# Transform both the training set and the test set.
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

# Make an instance of the model to retain 95% of the variance within the old features.
pca = PCA(.95)

print('Number of Principal Components = '+str(pca.n_components_))
# Number of Principal Components = 13

X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
Number of Principal Components = 13

After running the code above, you will notice that the PCA reduced the number of features from 20 to 13 by turning the original features into a new set of components that keep 95% of the variance of the information in the original set.

Using Decision Tree Algorithm

A decision tree model learns by dividing the training set into subsets based on an attribute value test, and this process is repeated over recursive partitions until the subset at a node has the same value as the target parameter, or when additional splitting does not improve. the predictive capacity of the model.

I will adapt the decision tree classifier to the training set and save the model parameters to a pickle file, which can be imported for future use. We then use the model to predict or rank the texts in the languages using the test set.

In [9]:
dt_clf = DecisionTreeClassifier() # Create Decision Tree classifer object
dt_clf = dt_clf.fit(X_train,y_train) # Fit/Train Decision Tree Classifer on training set

# Save model to file in the current working directory so that it can be imported and used.
# I use the pickle library to save the parameters of the trained model
pkl_file = "decision_tree_model.pkl"
with open(pkl_file, 'wb') as file:
    pickle.dump(dt_clf, file)

# Load previously trained model from pickle file
with open(pkl_file, 'rb') as file:
    dt_clf = pickle.load(file)

dt_clf # parameters of the Decision Tree model are shown below and can be further optimized to improve model performance

y_pred = dt_clf.predict(X_test) #Predict the response for test dataset

Now let’s have a look at the accuracy of our language classification model:

In [11]:
accuracy_score_dt = accuracy_score(y_test, y_pred)

The decision tree algorithm gave an accuracy of almost 90%. Now let’s have a look at the confusion matrix to visualize the classified languages with their accuracy :

In [13]:
labels = ['English', 'Afrikaans', 'Nederlands']
# Confusion Matrix
cm_Model_dt = confusion_matrix(y_test, y_pred, labels)
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(111)
sns.heatmap(cm_Model_dt, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r')
title = 'Decision Tree Model Accuracy Score = '+ str(round(accuracy_score_dt*100,2)) +"%"
plt.title(title, size = 15)
Language Classification Model Accuracy

The graph above shows how many texts were categorized correctly in each of the languages, with the y-axis representing actual or actual output and the x-axis representing expected output. This tells us that the model does well at predicting English texts, in addition to Afrikaans texts.

