Named Entity Recognition with Python
In machine learning, the recognition of named entities is an essential subtask of natural language processing. It tries to recognize and classify multi-word phrases with special meaning, e.g. people, organizations, places, dates, etc.
Named entity recognition comes from information retrieval (IE). IE’s job is to transform unstructured data into structured information. In Named Entity Recognition, unstructured data is the text written in natural language and we want to extract important information in a well-defined format eg. relational database.
The Named Entity Recognition task attempts to correctly detect and classify text expressions into a set of predefined classes. Classes can vary, but very often classes like people (PER), organizations (ORG) or places (LOC) are used.
I will start this task by importing the necessary Python libraries and the dataset:
import pandas as pd data = pd.read_csv('ner_dataset.csv', encoding= 'unicode_escape') data.head()
Sentence # | Word | POS | Tag | |
---|---|---|---|---|
0 | Sentence: 1 | Thousands | NNS | O |
1 | NaN | of | IN | O |
2 | NaN | demonstrators | NNS | O |
3 | NaN | have | VBP | O |
4 | NaN | marched | VBN | O |
I will train a neural network for the Named Entity Recognition (NER) task. So we need to make some modifications to the data to prepare it so that it can easily fit into a neutral network. I’ll start this step by extracting the mappings needed to train the neural network:
from itertools import chain def get_dict_map(data, token_or_tag): tok2idx = {} idx2tok = {} if token_or_tag == 'token': vocab = list(set(data['Word'].to_list())) else: vocab = list(set(data['Tag'].to_list())) idx2tok = {idx:tok for idx, tok in enumerate(vocab)} tok2idx = {tok:idx for idx, tok in enumerate(vocab)} return tok2idx, idx2tok token2idx, idx2token = get_dict_map(data, 'token') tag2idx, idx2tag = get_dict_map(data, 'tag') data['Word_idx'] = data['Word'].map(token2idx) data['Tag_idx'] = data['Tag'].map(tag2idx)
Now, I’m going to transform the columns in the data to extract the sequential data from our neural network:
data_fillna = data.fillna(method='ffill', axis=0) # Groupby and collect columns data_group = data_fillna.groupby( ['Sentence #'],as_index=False )['Word', 'POS', 'Tag', 'Word_idx', 'Tag_idx'].agg(lambda x: list(x))
I will now divide the data into training and test sets. I am going to create a function to split the data as LSTM layers only accept sequences of the same length. Thus, each sentence that appears as an integer in the data must be completed with the same length:
from sklearn.model_selection import train_test_split from keras.preprocessing.sequence import pad_sequences from keras.utils import to_categorical def get_pad_train_test_val(data_group, data): #get max token and tag length n_token = len(list(set(data['Word'].to_list()))) n_tag = len(list(set(data['Tag'].to_list()))) #Pad tokens (X var) tokens = data_group['Word_idx'].tolist() maxlen = max([len(s) for s in tokens]) pad_tokens = pad_sequences(tokens, maxlen=maxlen, dtype='int32', padding='post', value= n_token - 1) #Pad Tags (y var) and convert it into one hot encoding tags = data_group['Tag_idx'].tolist() pad_tags = pad_sequences(tags, maxlen=maxlen, dtype='int32', padding='post', value= tag2idx["O"]) n_tags = len(tag2idx) pad_tags = [to_categorical(i, num_classes=n_tags) for i in pad_tags] #Split train, test and validation set tokens_, test_tokens, tags_, test_tags = train_test_split(pad_tokens, pad_tags, test_size=0.1, train_size=0.9, random_state=2020) train_tokens, val_tokens, train_tags, val_tags = train_test_split(tokens_,tags_,test_size = 0.25,train_size =0.75, random_state=2020) print( 'train_tokens length:', len(train_tokens), '\ntrain_tokens length:', len(train_tokens), '\ntest_tokens length:', len(test_tokens), '\ntest_tags:', len(test_tags), '\nval_tokens:', len(val_tokens), '\nval_tags:', len(val_tags), ) return train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags train_tokens, val_tokens, test_tokens, train_tags, val_tags, test_tags = get_pad_train_test_val(data_group, data)
Using TensorFlow backend.
train_tokens length: 32372 train_tokens length: 32372 test_tokens length: 4796 test_tags: 4796 val_tokens: 10791 val_tags: 10791
I will now proceed to train the neural network architecture of our model. So let’s start by importing all the packages we need to train our neural network. Next, I’ll create layers that will take the dimensions of the LSTM layer and give the maximum length and maximum tags as output:
import numpy as np import tensorflow from tensorflow.keras import Sequential, Model, Input from tensorflow.keras.layers import LSTM, Embedding, Dense, TimeDistributed, Dropout, Bidirectional from tensorflow.keras.utils import plot_model from numpy.random import seed seed(1) tensorflow.random.set_seed(2) input_dim = len(list(set(data['Word'].to_list())))+1 output_dim = 64 input_length = max([len(s) for s in data_group['Word_idx'].tolist()]) n_tags = len(tag2idx)
Now I will create a helper function that will help us to give the summary of each layer of the neural network model for the task of recognizing named entities with Python:
def get_bilstm_lstm_model(): model = Sequential() # Add Embedding layer model.add(Embedding(input_dim=input_dim, output_dim=output_dim, input_length=input_length)) # Add bidirectional LSTM model.add(Bidirectional(LSTM(units=output_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2), merge_mode = 'concat')) # Add LSTM model.add(LSTM(units=output_dim, return_sequences=True, dropout=0.5, recurrent_dropout=0.5)) # Add timeDistributed Layer model.add(TimeDistributed(Dense(n_tags, activation="relu"))) #Optimiser # adam = k.optimizers.Adam(lr=0.0005, beta_1=0.9, beta_2=0.999) # Compile model model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) model.summary() return model
Now I will create a function to train our model:
def train_model(X, y, model): loss = list() for i in range(25): # fit model for one epoch on this sequence hist = model.fit(X, y, batch_size=1000, verbose=1, epochs=1, validation_split=0.2) loss.append(hist.history['loss'][0]) return loss results = pd.DataFrame() model_bilstm_lstm = get_bilstm_lstm_model() plot_model(model_bilstm_lstm) results['with_add_lstm'] = train_model(train_tokens, np.array(train_tags), model_bilstm_lstm)
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding (Embedding) (None, 104, 64) 2251456 _________________________________________________________________ bidirectional (Bidirectional (None, 104, 128) 66048 _________________________________________________________________ lstm_1 (LSTM) (None, 104, 64) 49408 _________________________________________________________________ time_distributed (TimeDistri (None, 104, 17) 1105 ================================================================= Total params: 2,368,017 Trainable params: 2,368,017 Non-trainable params: 0 _________________________________________________________________ 26/26 [==============================] - 79s 3s/step - loss: 0.9465 - accuracy: 0.9178 - val_accuracy: 0.9681 - val_loss: 0.4600 26/26 [==============================] - 78s 3s/step - loss: 0.3821 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.3507 26/26 [==============================] - 78s 3s/step - loss: 0.3583 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.3141 26/26 [==============================] - 78s 3s/step - loss: 0.3061 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.2781 26/26 [==============================] - 78s 3s/step - loss: 0.2905 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.2648 26/26 [==============================] - 78s 3s/step - loss: 0.2787 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.2583 26/26 [==============================] - 78s 3s/step - loss: 0.2705 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.2547 26/26 [==============================] - 79s 3s/step - loss: 0.3674 - accuracy: 0.9676 - val_accuracy: 0.9670 - val_loss: 0.6538 26/26 [==============================] - 78s 3s/step - loss: 0.4729 - accuracy: 0.9664 - val_accuracy: 0.9682 - val_loss: 0.2367 26/26 [==============================] - 79s 3s/step - loss: 0.2188 - accuracy: 0.9678 - val_accuracy: 0.9681 - val_loss: 0.1946 26/26 [==============================] - 78s 3s/step - loss: 0.2012 - accuracy: 0.9678 - val_accuracy: 0.9681 - val_loss: 0.1882 26/26 [==============================] - 78s 3s/step - loss: 0.1905 - accuracy: 0.9678 - val_accuracy: 0.9681 - val_loss: 0.1954 26/26 [==============================] - 78s 3s/step - loss: 0.1877 - accuracy: 0.9678 - val_accuracy: 0.9682 - val_loss: 0.3138 26/26 [==============================] - 79s 3s/step - loss: 0.2475 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.1898 26/26 [==============================] - 78s 3s/step - loss: 0.1790 - accuracy: 0.9677 - val_accuracy: 0.9681 - val_loss: 0.1622 26/26 [==============================] - 78s 3s/step - loss: 0.1530 - accuracy: 0.9678 - val_accuracy: 0.9681 - val_loss: 0.1506 26/26 [==============================] - 79s 3s/step - loss: 0.1464 - accuracy: 0.9678 - val_accuracy: 0.9682 - val_loss: 0.1740 26/26 [==============================] - 80s 3s/step - loss: 0.1813 - accuracy: 0.9678 - val_accuracy: 0.9682 - val_loss: 0.1599 26/26 [==============================] - 79s 3s/step - loss: 0.1380 - accuracy: 0.9678 - val_accuracy: 0.9682 - val_loss: 0.1367 26/26 [==============================] - 78s 3s/step - loss: 0.1274 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1330 26/26 [==============================] - 78s 3s/step - loss: 0.1217 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1282 26/26 [==============================] - 80s 3s/step - loss: 0.1175 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1243 26/26 [==============================] - 78s 3s/step - loss: 0.1129 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1210 26/26 [==============================] - 81s 3s/step - loss: 0.1108 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1199 26/26 [==============================] - 80s 3s/step - loss: 0.1094 - accuracy: 0.9679 - val_accuracy: 0.9682 - val_loss: 0.1201
Now, I will use the spacy library in Python to test our NER model. I will add input of some lines about my self and let’s see what we will get after running the code:
import spacy from spacy import displacy nlp = spacy.load('en_core_web_sm') text = nlp('Hi, My name is Ramesh Kumar \n I am from India \n I want to work with Google \n Steve Jobs is My Inspiration') displacy.render(text, style = 'ent', jupyter=True)