top of page

Latent Dirichlet Allocation(LDA)

Updated: Feb 1, 2021


What is topic modeling?

Topic modeling is a method for unsupervised classification of documents, similar to clustering on numeric data, which finds some natural groups of items (topics) even when we’re not sure what we’re looking for.


LDA MODEL:

In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you

  • Decide on the number of words N the document will have (say, according to a Poisson distribution).

  • Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.

  • Generate each word w_i in the document by:

    • First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).

    • Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “apple” with a 30% probability, “mangoes” with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.


Let's Understand by some implementation:

For this, The google Colab platform is used.

from google.colab import drive
drive.mount('/content/drive')

O/p: Drive already mounted at /content/drive

Here in the first step itself, Mount/IMport your Google Drive to the colab. In the above piece of code, the drive has already mounted.

loc = '/content/drive/My Drive/ldaa/lda/data.csv'

Give the location of the CSV file.


The next step is importing the libraries:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation as LDA
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')

The libraries like Numpy, Pandas, matplotlib, seaborn has been imported with this from sklearn.feature_extraction and sklearn. Decomposition CountVectorizer and LDA have been Imported.


Step:- 2

df = pd.read_csv(loc)
print(df.head())
print(df.columns)

# Remove punctuation
df['processed'] = df['text'].map(lambda x: re.sub('[,\.!?]', '', x))
# Convert the titles to lowercase
df['processed'] = df['processed'].map(lambda x: x.lower())

print(df['processed'].head())

cv = CountVectorizer(stop_words='english')
# Fit and transform the processed titles
cv_data = cv.fit_transform(df['processed'])

def print_topics(model, count_vectorizer, n_top_words):
    words = count_vectorizer.get_feature_names()
    for topic_idx, topic in enumerate(model.components_):
        print("\nTopic #%d:" % topic_idx)
        print(" ".join([words[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))

In this step on the first line itself tells us that the file has been read through pandas. Then the preprocessing steps such as Removing Punctuations, Removing stopwords are done, and then in the next step, the data has been encoded into a vectorized format.

In the next step, we have to import guidelda library

import guidedlda

This libary is not pre-installed in google colab so we have to first install it on colab.

By this below syntax

! pip install guidelda


top= ['glassdoor_reviews','tech_news','room_rentals','sports_news','automobiles']

let's analyze a list top with some words.

vocab = cv.vocabulary_

Create a vocabulary and store it on vocab.


word2id=dict((v,idx)foridx,vinenumerate(vocab))

Then put the vocab in a dictionary.


model = guidedlda.GuidedLDA(n_topics=5, n_iter=100, random_state=7, refresh=20)

Then load a model with LDA Library with the parameter of n_topics=5, with iterations of 100 and the random_state=7

model.fit(cv_data)

INFO:guidedlda:n_documents: 6248 INFO:guidedlda:vocab_size: 22671 INFO:guidedlda:n_words: 228181 INFO:guidedlda:n_topics: 5 INFO:guidedlda:n_iter: 100 /usr/local/lib/python3.6/dist-packages/guidedlda/utils.py:55: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`. if sparse and not np.issubdtype(doc_word.dtype, int): INFO:guidedlda:<0> log likelihood: -2559963 INFO:guidedlda:<20> log likelihood: -1975941 INFO:guidedlda:<40> log likelihood: -1927156 INFO:guidedlda:<60> log likelihood: -1901394 INFO:guidedlda:<80> log likelihood: -1891037 INFO:guidedlda:<99> log likelihood: -1888564


model.fit(cv_data, seed_topics=top, seed_confidence=0.15)

INFO:guidedlda:n_documents: 6248 INFO:guidedlda:vocab_size: 22671 INFO:guidedlda:n_words: 228181 INFO:guidedlda:n_topics: 5 INFO:guidedlda:n_iter: 100 /usr/local/lib/python3.6/dist-packages/guidedlda/utils.py:55: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`. if sparse and not np.issubdtype(doc_word.dtype, int): INFO:guidedlda:<0> log likelihood: -2559963 INFO:guidedlda:<20> log likelihood: -1975941 INFO:guidedlda:<40> log likelihood: -1927156 INFO:guidedlda:<60> log likelihood: -1901394 INFO:guidedlda:<80> log likelihood: -1891037 INFO:guidedlda:<99> log likelihood: -1888564


doc_topic = model.transform(cv_data)
doc_topic[0]

In this step, we have printed the first row.

O/p: array([2.19152497e-01, 3.02792582e-03, 7.77457222e-01, 2.06843509e-04, 1.55511855e-04])


let's print the length of the document topic.

len(doc_topic)

O/p: 6248

doc_topic[20]

O/p: array([2.62679941e-01, 5.61569637e-04, 2.60673256e-03, 7.32870579e-01, 1.28117723e-03])


Let's create a data frame on doc_topic.

proba = pd.DataFrame(doc_topic)
proba.columns = top

Then merge this Proba data frame with the left_index

op =df.merge(proba,left_index= True,right_index =True)

Let's Check the head part

op.head()

Then Save the file to the drive.

op.to_csv('/content/drive/My Drive/ldaa/lda/op2.csv')

So in this way, we can build the LDA Model.


For code Refer this link:


Thank You For Reading!

Happy Learning.

Comments


bottom of page