• For any query, contact us at
  • +91-9872993883
  • +91-8283824812
  • info@ris-ai.com

Topic Modeling in Machine Learning using Python

Topic modeling is a type of statistical modeling for discovering abstract “subjects” that appear in a collection of documents. This means creating one topic per document template and words per topic template, modeled as Dirichlet distributions. In this article, I will walk you through the task of Topic Modeling in Machine Learning with Python.

Now let’s get started with the task of Topic Modeling with Python by importing all the necessary libraries that we need for this task:

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

Now, the next step is to read all the datasets that I am using in this task:

In [2]:
train = pd.read_csv("Train.csv")
test = pd.read_csv("Test.csv")
tags = pd.read_csv("Tags.csv")
sample_sub = pd.read_csv("SampleSubmission.csv")

# print(train.isna().sum)
# print(test.isna().sum)

Exploratory Data Analysis

Exploratory Data Analysis explores the data to find the relationship between measures that tell us they exist, without the cause. They can be used to formulate hypotheses. EDA helps you discover relationships between measures in your data, which do not prove the existence of correlation, as indicated by the expression.

In [3]:
train["Number of Characters"] = train["ABSTRACT"].apply(lambda x: len(str(x)))
test["Number of Characters"] = test["ABSTRACT"].apply(lambda x: len(str(x)))
fig = make_subplots(rows=1, cols=2)
trace1 = go.Histogram(x = train["Number of Characters"])
fig.add_trace(trace1, row=1, col=1)

trace2 = go.Box(y = train["Number of Characters"])
fig.add_trace(trace2, row=1, col=2)
fig.update_layout(showlegend=False)
fig.show()