You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Apologies for the non reprex (due to size), but below is code using example from the textmineR package, so it should be reproducible.
Issue: reviewing model$summary to for, say, topic 1 t_1, it seems that it doesn't match with the t_1 marked in LDAvis plot.
I believe the definitions of phi P(token|topic) and theta P(topic|document) are the same across textmineR and LDAvis, so I'd expect similar topic/word clusters.
Note that the issue was originally posted with textmineR (TommyJones/textmineR#72), and the author suggested that the reason may be with LDAvis.
library(textmineR)
# load nih_sample data set from textmineR
data(nih_sample)
# create a document term matrix
dtm <- CreateDtm(doc_vec = nih_sample$ABSTRACT_TEXT, # character vector of documents
doc_names = nih_sample$APPLICATION_ID, # document names
ngram_window = c(1, 2), # minimum and maximum n-gram length
stopword_vec = c(stopwords::stopwords("en"), # stopwords from tm
stopwords::stopwords(source = "smart")), # this is the default value
lower = TRUE, # lowercase - this is the default value
remove_punctuation = TRUE, # punctuation - this is the default
remove_numbers = TRUE, # numbers - this is the default
verbose = FALSE, # Turn off status bar for this demo
cpus = 2) # default is all available cpus on the system
dtm <- dtm[,colSums(dtm) > 2]
set.seed(12345)
model <- FitLdaModel(dtm = dtm,
k = 20,
iterations = 200, # I usually recommend at least 500 iterations or more
burnin = 180,
alpha = 0.1,
beta = 0.05,
optimize_alpha = TRUE,
calc_likelihood = TRUE,
calc_coherence = TRUE,
calc_r2 = TRUE,
cpus = 2)
model$top_terms <- GetTopTerms(phi = model$phi, M = 10)
# Get the prevalence of each topic
# You can make this discrete by applying a threshold, say 0.05, for
# topics in/out of docuemnts.
model$prevalence <- colSums(model$theta) / sum(model$theta) * 100
# textmineR has a naive topic labeling tool based on probable bigrams
model$labels <- LabelTopics(assignments = model$theta > 0.05,
dtm = dtm,
M = 1)
model$summary <- data.frame(topic = rownames(model$phi),
label = model$labels,
coherence = round(model$coherence, 3),
prevalence = round(model$prevalence,3),
top_terms = apply(model$top_terms, 2, function(x){
paste(x, collapse = ", ")
}),
stringsAsFactors = FALSE)
model$summary[ order(model$summary$prevalence, decreasing = TRUE) , ][ 1:10 , ]
# summary of document lengths
doc_lengths <- rowSums(dtm)
# get counts of tokens across the corpus
tf_mat <- TermDocFreq(dtm = dtm)
tf_mat
library(LDAvis)
# create the JSON object to feed the visualization:
json <- createJSON(
phi = model$phi,
theta = model$theta,
doc.length = doc_lengths,
vocab = tf_mat$term,
term.frequency = tf_mat$term_freq
)
serVis(json, open.browser = TRUE)
The text was updated successfully, but these errors were encountered:
Having played with @leungi's example, it looks like the row index on the phi matrix is shuffled in LDAvis compared to the row order of model$phi which is being fed into the JSON.
Apologies for the non
reprex
(due to size), but below is code using example from thetextmineR
package, so it should be reproducible.Issue: reviewing
model$summary
to for, say, topic 1t_1
, it seems that it doesn't match with thet_1
marked inLDAvis
plot.I believe the definitions of
phi
P(token|topic) andtheta
P(topic|document) are the same acrosstextmineR
andLDAvis
, so I'd expect similar topic/word clusters.Note that the issue was originally posted with
textmineR
(TommyJones/textmineR#72), and the author suggested that the reason may be withLDAvis
.The text was updated successfully, but these errors were encountered: