Language Classification with Machine Learning
The language classification is the grouping of associated languages in the same category. Languages are grouped diachronically into language families. In other words, languages are grouped according to their development and evolution throughout history, with languages that descend from a common ancestor being grouped in the same language family.
So let’s start with the task of language classification with Machine Learning using the python programming language by importing all the modules and packages needed for this task:
import numpy as np # For arithmetics and arrays import math # For inbuilt math functions import pandas as pd # For handling data frames import collections # used for dictionaries and counters from itertools import permutations # used to find permutations from sklearn.ensemble import GradientBoostingClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier from sklearn.model_selection import train_test_split # Import train_test_split function to easily split data into training and testing samples from sklearn.decomposition import PCA # Principal component analysis used to reduce the number of features in a model from sklearn.preprocessing import StandardScaler # used to scale data to be used in the model from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation from sklearn.metrics import confusion_matrix from sklearn.metrics import classification_report from sklearn.metrics import roc_auc_score from sklearn.metrics import roc_curve from sklearn.metrics import accuracy_score from sklearn.metrics import log_loss import pickle # To save the trained model and then read it import seaborn as sns # Create plots sns.set(style="ticks") import matplotlib.pyplot as plt
Now, let’s import and clean our data.
df = pd.read_csv('lang_data.csv') # Read raw data df = df.dropna() # remove null values for the "text" column df['text'] = df['text'].astype(str) # Convert the column "text" from object to a string in order to operate on it df['language'] = df['language'].astype(str)
I will now create a new set of features that will be used in the language classification model to classify text into three languages: English, Afrikaans, and Dutch. As mentioned, different features may be more effective in classifying other languages:
# Define a list of commonly found punctuations punc = ('!', "," ,"\'" ,";" ,"\"", ".", "-" ,"?") vowels=['a','e','i','o','u'] # Define a list of double consecutive vowels which are typically found in Dutch and Afrikaans languages same_consecutive_vowels = ['aa','ee', 'ii', 'oo', 'uu'] consecutive_vowels = [''.join(p) for p in permutations(vowels,2)] dutch_combos = ['ij'] # Create a pre-defined set of features based on the "text" column in order to allow us to characterize the string df['word_count'] = df['text'].apply(lambda x : len(x.split())) df['character_count'] = df['text'].apply(lambda x : len(x.replace(" ",""))) df['word_density'] = df['word_count'] / (df['character_count'] + 1) df['punc_count'] = df['text'].apply(lambda x : len([a for a in x if a in punc])) df['v_char_count'] = df['text'].apply(lambda x : len([a for a in x if a.casefold() == 'v'])) df['w_char_count'] = df['text'].apply(lambda x : len([a for a in x if a.casefold() == 'w'])) df['ij_char_count'] = df['text'].apply(lambda x : sum([any(d_c in a for d_c in dutch_combos) for a in x.split()])) df['num_double_consec_vowels'] = df['text'].apply(lambda x : sum([any(c_v in a for c_v in same_consecutive_vowels) for a in x.split()])) df['num_consec_vowels'] = df['text'].apply(lambda x : sum([any(c_v in a for c_v in consecutive_vowels) for a in x.split()])) df['num_vowels'] = df['text'].apply(lambda x : sum([any(v in a for v in vowels) for a in x.split()])) df['vowel_density'] = df['num_vowels']/df['word_count'] df['capitals'] = df['text'].apply(lambda comment: sum(1 for c in comment if c.isupper())) df['caps_vs_length'] = df.apply(lambda row: float(row['capitals'])/float(row['character_count']),axis=1) df['num_exclamation_marks'] =df['text'].apply(lambda x: x.count('!')) df['num_question_marks'] = df['text'].apply(lambda x: x.count('?')) df['num_punctuation'] = df['text'].apply(lambda x: sum(x.count(w) for w in punc)) df['num_unique_words'] = df['text'].apply(lambda x: len(set(w for w in x.split()))) df['num_repeated_words'] = df['text'].apply(lambda x: len([w for w in collections.Counter(x.split()).values() if w > 1])) df['words_vs_unique'] = df['num_unique_words'] / df['word_count'] df['encode_ascii'] = np.nan for i in range(len(df)): try: df['text'].iloc[i].encode(encoding='utf-8').decode('ascii') except UnicodeDecodeError: df['encode_ascii'].iloc[i] = 0 else: df['encode_ascii'].iloc[i] = 1
/usr/local/lib/python3.6/dist-packages/pandas/core/indexing.py:670: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy iloc._setitem_with_indexer(indexer, value)
After building the above feature set, we can calculate averages of these features by language to check if there are any obvious significant differences. To do this, simply run the command below:
df.groupby('language').mean().T
language | Afrikaans | English | Nederlands |
---|---|---|---|
word_count | 10.503912 | 4.072506 | 5.746269 |
character_count | 43.541471 | 16.841849 | 26.074627 |
word_density | 0.234060 | 0.226490 | 0.209378 |
punc_count | 1.507042 | 0.317275 | 1.223881 |
v_char_count | 0.652582 | 0.126521 | 0.358209 |
w_char_count | 0.904538 | 0.291971 | 0.522388 |
ij_char_count | 0.000000 | 0.000000 | 0.268657 |
num_double_consec_vowels | 1.696401 | 0.178589 | 1.014925 |
num_consec_vowels | 2.773083 | 0.536253 | 1.134328 |
num_vowels | 9.114241 | 3.742579 | 5.671642 |
vowel_density | 0.861087 | 0.930949 | 0.990534 |
capitals | 1.510172 | 1.193674 | 1.014925 |
caps_vs_length | 0.040771 | 0.082882 | 0.042959 |
num_exclamation_marks | 0.054773 | 0.003406 | 0.000000 |
num_question_marks | 0.014085 | 0.010219 | 0.000000 |
num_punctuation | 1.507042 | 0.317275 | 1.223881 |
num_unique_words | 9.543036 | 3.992214 | 5.567164 |
num_repeated_words | 0.787167 | 0.076399 | 0.179104 |
words_vs_unique | 0.948318 | 0.990175 | 0.978167 |
encode_ascii | 0.597809 | 0.997567 | 0.955224 |
Looking at the first feature, for example, word_count, we can notice that Afrikaans sentences are likely to be made up of more words than English and Dutch.
Next, we need to look at the degree of correlation between the characteristics we have created. The idea behind correlation with the context of our task it that if two or more characteristics are strongly correlated with each other, then it is likely that they will have very similar explanatory power when classifying languages.
As such, we can only keep one of these features and get the same predictive power from our model. To calculate the correlation matrix :
df.corr(method ='pearson')
word_count | character_count | word_density | punc_count | v_char_count | w_char_count | ij_char_count | num_double_consec_vowels | num_consec_vowels | num_vowels | vowel_density | capitals | caps_vs_length | num_exclamation_marks | num_question_marks | num_punctuation | num_unique_words | num_repeated_words | words_vs_unique | encode_ascii | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
word_count | 1.000000 | 0.963818 | 0.284142 | 0.656144 | 0.499937 | 0.576566 | 0.002617 | 0.714460 | 0.769932 | 0.985911 | -0.158172 | 0.408048 | -0.449626 | 0.161356 | 0.054530 | 0.656144 | 0.985286 | 0.785662 | -0.609634 | -0.500432 |
character_count | 0.963818 | 1.000000 | 0.066516 | 0.685772 | 0.535066 | 0.579176 | 0.012066 | 0.737948 | 0.801404 | 0.960918 | -0.089117 | 0.409393 | -0.486624 | 0.174488 | 0.065903 | 0.685772 | 0.950250 | 0.751124 | -0.570829 | -0.506098 |
word_density | 0.284142 | 0.066516 | 1.000000 | 0.001107 | 0.002239 | 0.061311 | -0.032624 | 0.050620 | 0.042554 | 0.249241 | -0.356830 | 0.081018 | 0.062398 | -0.011030 | 0.001162 | 0.001107 | 0.306803 | 0.143736 | -0.178659 | -0.077037 |
punc_count | 0.656144 | 0.685772 | 0.001107 | 1.000000 | 0.370564 | 0.374964 | 0.042030 | 0.548477 | 0.563451 | 0.621920 | -0.228459 | 0.385474 | -0.287332 | 0.205043 | 0.119390 | 1.000000 | 0.652952 | 0.487491 | -0.379424 | -0.392489 |
v_char_count | 0.499937 | 0.535066 | 0.002239 | 0.370564 | 1.000000 | 0.233034 | 0.022992 | 0.389565 | 0.424615 | 0.499093 | -0.038331 | 0.250590 | -0.249199 | 0.095870 | 0.025231 | 0.370564 | 0.502898 | 0.367544 | -0.287014 | -0.337758 |
w_char_count | 0.576566 | 0.579176 | 0.061311 | 0.374964 | 0.233034 | 1.000000 | 0.005225 | 0.468219 | 0.413632 | 0.580779 | -0.002004 | 0.204286 | -0.280512 | 0.111205 | 0.101982 | 0.374964 | 0.562223 | 0.464511 | -0.353433 | -0.298730 |
ij_char_count | 0.002617 | 0.012066 | -0.032624 | 0.042030 | 0.022992 | 0.005225 | 1.000000 | 0.000365 | -0.010899 | 0.012548 | 0.047971 | -0.021422 | -0.048555 | -0.007353 | -0.007675 | 0.042030 | 0.007278 | -0.016064 | 0.017089 | 0.023861 |
num_double_consec_vowels | 0.714460 | 0.737948 | 0.050620 | 0.548477 | 0.389565 | 0.468219 | 0.000365 | 1.000000 | 0.588212 | 0.707235 | -0.079424 | 0.245026 | -0.352140 | 0.174573 | 0.040402 | 0.548477 | 0.705972 | 0.563045 | -0.412084 | -0.471362 |
num_consec_vowels | 0.769932 | 0.801404 | 0.042554 | 0.563451 | 0.424615 | 0.413632 | -0.010899 | 0.588212 | 1.000000 | 0.776344 | -0.044574 | 0.286679 | -0.401469 | 0.121268 | 0.033055 | 0.563451 | 0.759246 | 0.604320 | -0.452598 | -0.436547 |
num_vowels | 0.985911 | 0.960918 | 0.249241 | 0.621920 | 0.499093 | 0.580779 | 0.012548 | 0.707235 | 0.776344 | 1.000000 | -0.022570 | 0.357889 | -0.480243 | 0.158128 | 0.057858 | 0.621920 | 0.968959 | 0.782950 | -0.614507 | -0.450493 |
vowel_density | -0.158172 | -0.089117 | -0.356830 | -0.228459 | -0.038331 | -0.002004 | 0.047971 | -0.079424 | -0.044574 | -0.022570 | 1.000000 | -0.196330 | -0.119797 | -0.011372 | 0.006199 | -0.228459 | -0.177521 | -0.051552 | 0.036457 | 0.244971 |
capitals | 0.408048 | 0.409393 | 0.081018 | 0.385474 | 0.250590 | 0.204286 | -0.021422 | 0.245026 | 0.286679 | 0.357889 | -0.196330 | 1.000000 | 0.282340 | 0.122447 | 0.082210 | 0.385474 | 0.399172 | 0.329163 | -0.195691 | -0.199190 |
caps_vs_length | -0.449626 | -0.486624 | 0.062398 | -0.287332 | -0.249199 | -0.280512 | -0.048555 | -0.352140 | -0.401469 | -0.480243 | -0.119797 | 0.282340 | 1.000000 | -0.046046 | -0.012805 | -0.287332 | -0.476189 | -0.258355 | 0.282542 | 0.259936 |
num_exclamation_marks | 0.161356 | 0.174488 | -0.011030 | 0.205043 | 0.095870 | 0.111205 | -0.007353 | 0.174573 | 0.121268 | 0.158128 | -0.011372 | 0.122447 | -0.046046 | 1.000000 | 0.104786 | 0.205043 | 0.162700 | 0.097104 | -0.059455 | -0.113380 |
num_question_marks | 0.054530 | 0.065903 | 0.001162 | 0.119390 | 0.025231 | 0.101982 | -0.007675 | 0.040402 | 0.033055 | 0.057858 | 0.006199 | 0.082210 | -0.012805 | 0.104786 | 1.000000 | 0.119390 | 0.062724 | 0.015089 | -0.016997 | -0.037012 |
num_punctuation | 0.656144 | 0.685772 | 0.001107 | 1.000000 | 0.370564 | 0.374964 | 0.042030 | 0.548477 | 0.563451 | 0.621920 | -0.228459 | 0.385474 | -0.287332 | 0.205043 | 0.119390 | 1.000000 | 0.652952 | 0.487491 | -0.379424 | -0.392489 |
num_unique_words | 0.985286 | 0.950250 | 0.306803 | 0.652952 | 0.502898 | 0.562223 | 0.007278 | 0.705972 | 0.759246 | 0.968959 | -0.177521 | 0.399172 | -0.476189 | 0.162700 | 0.062724 | 0.652952 | 1.000000 | 0.679645 | -0.509275 | -0.511917 |
num_repeated_words | 0.785662 | 0.751124 | 0.143736 | 0.487491 | 0.367544 | 0.464511 | -0.016064 | 0.563045 | 0.604320 | 0.782950 | -0.051552 | 0.329163 | -0.258355 | 0.097104 | 0.015089 | 0.487491 | 0.679645 | 1.000000 | -0.853010 | -0.342473 |
words_vs_unique | -0.609634 | -0.570829 | -0.178659 | -0.379424 | -0.287014 | -0.353433 | 0.017089 | -0.412084 | -0.452598 | -0.614507 | 0.036457 | -0.195691 | 0.282542 | -0.059455 | -0.016997 | -0.379424 | -0.509275 | -0.853010 | 1.000000 | 0.272145 |
encode_ascii | -0.500432 | -0.506098 | -0.077037 | -0.392489 | -0.337758 | -0.298730 | 0.023861 | -0.471362 | -0.436547 | -0.450493 | 0.244971 | -0.199190 | 0.259936 | -0.113380 | -0.037012 | -0.392489 | -0.511917 | -0.342473 | 0.272145 | 1.000000 |
We can also visualize the pairwise correlation matrix using the following command :
sns.pairplot(df)
<seaborn.axisgrid.PairGrid at 0x7fbf3555ebe0>
We can notice how several of the variables are strongly positively correlated. For example, word_count and character_count have a correlation of around 96%, which means they tell us roughly the same thing in terms of the length of a text for each language considered.
Before going any further in building our linguistic classification model, we need to divide the dataset into training and test sets :
#split dataset into features and target variable feature_cols = list(df.columns)[2:] X = df[feature_cols] # Features y = df[['language']] # Target variable # Split dataset into training set and test set X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2) # 80% train and 20% test
We should aim to use only the most unique characteristics in our classification models, as the correlated variables do not add much to the predictive power of the models.
One method used in machine learning to reduce the correlation between features is called principal component analysis or PCA:
# Standardize the data scaler = StandardScaler() # Fit on training set only. scaler.fit(X_train) # Transform both the training set and the test set. X_train = scaler.transform(X_train) X_test = scaler.transform(X_test) # Make an instance of the model to retain 95% of the variance within the old features. pca = PCA(.95) pca.fit(X_train) print('Number of Principal Components = '+str(pca.n_components_)) # Number of Principal Components = 13 X_train = pca.transform(X_train) X_test = pca.transform(X_test)
Number of Principal Components = 13
After running the code above, you will notice that the PCA reduced the number of features from 20 to 13 by turning the original features into a new set of components that keep 95% of the variance of the information in the original set.
A decision tree model learns by dividing the training set into subsets based on an attribute value test, and this process is repeated over recursive partitions until the subset at a node has the same value as the target parameter, or when additional splitting does not improve. the predictive capacity of the model.
I will adapt the decision tree classifier to the training set and save the model parameters to a pickle file, which can be imported for future use. We then use the model to predict or rank the texts in the languages using the test set.
dt_clf = DecisionTreeClassifier() # Create Decision Tree classifer object dt_clf = dt_clf.fit(X_train,y_train) # Fit/Train Decision Tree Classifer on training set # Save model to file in the current working directory so that it can be imported and used. # I use the pickle library to save the parameters of the trained model pkl_file = "decision_tree_model.pkl" with open(pkl_file, 'wb') as file: pickle.dump(dt_clf, file) # Load previously trained model from pickle file with open(pkl_file, 'rb') as file: dt_clf = pickle.load(file) dt_clf # parameters of the Decision Tree model are shown below and can be further optimized to improve model performance y_pred = dt_clf.predict(X_test) #Predict the response for test dataset
Now let’s have a look at the accuracy of our language classification model:
accuracy_score_dt = accuracy_score(y_test, y_pred) print(accuracy_score_dt)
0.8462929475587704
The decision tree algorithm gave an accuracy of almost 90%. Now let’s have a look at the confusion matrix to visualize the classified languages with their accuracy :
labels = ['English', 'Afrikaans', 'Nederlands'] # Confusion Matrix cm_Model_dt = confusion_matrix(y_test, y_pred, labels) fig = plt.figure(figsize=(9,9)) ax = fig.add_subplot(111) sns.heatmap(cm_Model_dt, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r') plt.ylabel('Actual') plt.xlabel('Predicted') ax.set_xticklabels(labels) ax.set_yticklabels(labels) title = 'Decision Tree Model Accuracy Score = '+ str(round(accuracy_score_dt*100,2)) +"%" plt.title(title, size = 15)
/home/webtunix/.local/lib/python3.6/site-packages/sklearn/utils/validation.py:72: FutureWarning: Pass labels=['English', 'Afrikaans', 'Nederlands'] as keyword args. From version 1.0 (renaming of 0.25) passing these as positional arguments will result in an error "will result in an error", FutureWarning)
Text(0.5, 1.0, 'Decision Tree Model Accuracy Score = 84.63%')
The graph above shows how many texts were categorized correctly in each of the languages, with the y-axis representing actual or actual output and the x-axis representing expected output. This tells us that the model does well at predicting English texts, in addition to Afrikaans texts.