Topic Modeling in Machine Learning
Topic modeling is a type of statistical modeling for discovering abstract “subjects” that appear in a collection of documents. This means creating one topic per document template and words per topic template, modeled as Dirichlet distributions. In this article, I will walk you through the task of Topic Modeling in Machine Learning with Python.
Now let’s get started with the task of Topic Modeling with Python by importing all the necessary libraries that we need for this task:
import pandas as pd import numpy as np import plotly.express as px import plotly.graph_objects as go from plotly.subplots import make_subplots
Now, the next step is to read all the datasets that I am using in this task:
train = pd.read_csv("Train.csv") test = pd.read_csv("Test.csv") tags = pd.read_csv("Tags.csv") sample_sub = pd.read_csv("SampleSubmission.csv") # print(train.isna().sum) # print(test.isna().sum)
Exploratory Data Analysis explores the data to find the relationship between measures that tell us they exist, without the cause. They can be used to formulate hypotheses. EDA helps you discover relationships between measures in your data, which do not prove the existence of correlation, as indicated by the expression.
train["Number of Characters"] = train["ABSTRACT"].apply(lambda x: len(str(x))) test["Number of Characters"] = test["ABSTRACT"].apply(lambda x: len(str(x))) fig = make_subplots(rows=1, cols=2) trace1 = go.Histogram(x = train["Number of Characters"]) fig.add_trace(trace1, row=1, col=1) trace2 = go.Box(y = train["Number of Characters"]) fig.add_trace(trace2, row=1, col=2) fig.update_layout(showlegend=False) fig.show()