• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Topic Modeling in Machine Learning using Python

Topic modeling is a type of statistical modeling for discovering abstract “subjects” that appear in a collection of documents. This means creating one topic per document template and words per topic template, modeled as Dirichlet distributions. In this article, I will walk you through the task of Topic Modeling in Machine Learning with Python.

Now let’s get started with the task of Topic Modeling with Python by importing all the necessary libraries that we need for this task:

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Now, the next step is to read all the datasets that I am using in this task:

In [2]:
train = pd.read_csv("Train.csv")
test = pd.read_csv("Test.csv")
tags = pd.read_csv("Tags.csv")
sample_sub = pd.read_csv("SampleSubmission.csv")

# print(train.isna().sum)
# print(test.isna().sum)

Exploratory Data Analysis

Exploratory Data Analysis explores the data to find the relationship between measures that tell us they exist, without the cause. They can be used to formulate hypotheses. EDA helps you discover relationships between measures in your data, which do not prove the existence of correlation, as indicated by the expression.

In [3]:
train["Number of Characters"] = train["ABSTRACT"].apply(lambda x: len(str(x)))
test["Number of Characters"] = test["ABSTRACT"].apply(lambda x: len(str(x)))
fig = make_subplots(rows=1, cols=2)
trace1 = go.Histogram(x = train["Number of Characters"])
fig.add_trace(trace1, row=1, col=1)

trace2 = go.Box(y = train["Number of Characters"])
fig.add_trace(trace2, row=1, col=2)
fig.update_layout(showlegend=False)
fig.show()

There is great variability in the number of characters in the Abstracts of the Train set. We have a minimum of 54 to a maximum of 4551 characters on the train. The median number of characters is 1065.

In [4]:
fig = make_subplots(rows=1, cols=2)
trace1 = go.Histogram(x = test["Number of Characters"])
fig.add_trace(trace1, row=1, col=1)

trace2 = go.Box(y = test["Number of Characters"])
fig.add_trace(trace2, row=1, col=2)
fig.update_layout(showlegend=False)
fig.show()

The test set looks better than the training set as the minimum number of characters in the test set is 46, while the maximum is 2841. So the median number of characters in the test set is 1058, which is very similar to the training set.

In [5]:
train['Number of Words'] = train['ABSTRACT'].apply(lambda x: len(str(x).split()))
test['Number of Words'] = test['ABSTRACT'].apply(lambda x: len(str(x).split()))
fig = make_subplots(rows = 1, cols = 2)
trace1 = go.Histogram(x = train['Number of Words'])
fig.add_trace(trace1, row = 1, col = 1)

trace2 = go.Box(y = train['Number of Words'])
fig.add_trace(trace2, row = 1, col = 2)

fig.update_layout(showlegend = False)
fig.show()

The learning set has a similar trend in the number of words as we have seen in the number of characters. Minimum of 8 words and maximum of 665 words. So the median word count is 153.

In [6]:
fig = make_subplots(rows = 1, cols = 2)
trace1 = go.Histogram(x = test['Number of Words'])
fig.add_trace(trace1, row = 1, col = 1)

trace2 = go.Box(y = test['Number of Words'])
fig.add_trace(trace2, row = 1, col = 2)

fig.update_layout(showlegend = False)
fig.show()

Minimum of 7 words in an abstract and maximum of 452 words in the test set. The median here is exactly the same as that observed in the training set and is equal to 153.

Topic Modeling Using Tags

There are a lot of methods of topic modeling. I will use the tags in this task, let’s see how to do this by exploring the tags:

In [7]:
main_tags = ['Computer Science',
 'Mathematics',
 'Physics',
 'Statistics']

countTagsTrain = pd.DataFrame(train[main_tags].sum(axis = 0) / len(train))
countTagsTest = pd.DataFrame(test[main_tags].sum(axis = 0) / len(test))

trace0 = go.Bar(x = countTagsTrain.index, y = countTagsTrain[0],name = 'Train Set')
trace1 = go.Bar(x = countTagsTest.index, y = countTagsTest[0],name = 'Test Set')

fig = go.Figure([trace0,trace1])
fig.show()

So this is how we can perform the task of Topic Modeling by using the Python programming language. I hope you liked this article on Topic Modeling in Machine Learning with Python. Feel free to ask your valuable questions in the comments section below.

Resources You Will Ever Need