Discrepancy in topic content when summarizing and visualizing with LDAvis #97

leungi · 2019-05-02T21:34:12Z

Apologies for the non reprex (due to size), but below is code using example from the textmineR package, so it should be reproducible.

Issue: reviewing model$summary to for, say, topic 1 t_1, it seems that it doesn't match with the t_1 marked in LDAvis plot.

I believe the definitions of phi P(token|topic) and theta P(topic|document) are the same across textmineR and LDAvis, so I'd expect similar topic/word clusters.

Note that the issue was originally posted with textmineR (TommyJones/textmineR#72), and the author suggested that the reason may be with LDAvis.

library(textmineR)

# load nih_sample data set from textmineR
data(nih_sample)

# create a document term matrix 
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT, # character vector of documents
                 doc_names = nih_sample$APPLICATION_ID, # document names
                 ngram_window = c(1, 2), # minimum and maximum n-gram length
                 stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
                                  stopwords::stopwords(source = "smart")), # this is the default value
                 lower = TRUE, # lowercase - this is the default value
                 remove_punctuation = TRUE, # punctuation - this is the default
                 remove_numbers = TRUE, # numbers - this is the default
                 verbose = FALSE, # Turn off status bar for this demo
                 cpus = 2) # default is all available cpus on the system

dtm <- dtm[,colSums(dtm) > 2]

set.seed(12345)

model <- FitLdaModel(dtm = dtm, 
                     k = 20,
                     iterations = 200, # I usually recommend at least 500 iterations or more
                     burnin = 180,
                     alpha = 0.1,
                     beta = 0.05,
                     optimize_alpha = TRUE,
                     calc_likelihood = TRUE,
                     calc_coherence = TRUE,
                     calc_r2 = TRUE,
                     cpus = 2) 

model$top_terms <- GetTopTerms(phi = model$phi, M = 10)

# Get the prevalence of each topic
# You can make this discrete by applying a threshold, say 0.05, for
# topics in/out of docuemnts. 
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100

# textmineR has a naive topic labeling tool based on probable bigrams
model$labels <- LabelTopics(assignments = model$theta > 0.05, 
                            dtm = dtm,
                            M = 1)


model$summary <- data.frame(topic = rownames(model$phi),
                            label = model$labels,
                            coherence = round(model$coherence, 3),
                            prevalence = round(model$prevalence,3),
                            top_terms = apply(model$top_terms, 2, function(x){
                              paste(x, collapse = ", ")
                            }),
                            stringsAsFactors = FALSE)
model$summary[ order(model$summary$prevalence, decreasing = TRUE) , ][ 1:10 , ]



# summary of document lengths
doc_lengths <- rowSums(dtm)
# get counts of tokens across the corpus
tf_mat <- TermDocFreq(dtm = dtm)
tf_mat


library(LDAvis)
# create the JSON object to feed the visualization:
json <- createJSON(
  phi = model$phi,
  theta = model$theta,
  doc.length = doc_lengths,
  vocab = tf_mat$term,
  term.frequency = tf_mat$term_freq
)

serVis(json, open.browser = TRUE)

The text was updated successfully, but these errors were encountered:

TommyJones · 2019-05-02T21:39:51Z

Having played with @leungi's example, it looks like the row index on the phi matrix is shuffled in LDAvis compared to the row order of model$phi which is being fed into the JSON.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy in topic content when summarizing and visualizing with LDAvis #97

Discrepancy in topic content when summarizing and visualizing with LDAvis #97

leungi commented May 2, 2019

TommyJones commented May 2, 2019

Discrepancy in topic content when summarizing and visualizing with LDAvis #97

Discrepancy in topic content when summarizing and visualizing with LDAvis #97

Comments

leungi commented May 2, 2019

TommyJones commented May 2, 2019