Skip to content

GSDMM Short Text Clustering via Dirichlet Mixture Models

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
Notifications You must be signed in to change notification settings

till-tietz/gsdmm

Repository files navigation

gsdmm

gsdmm implements short text classification via Dirichlet Mixture Models proposed by Yin and Wang 2014. It provides a fast c++ implementation and R interface for the Gibbs sampler described in the paper. Specifically, gsdmm implements the Likelihood function allowing for multiple occurrences of the same word in a given text (EQ4).

Benefits:

  • very space and time efficient
  • unlike LDA it requires only an upper bound on the number of clusters

Development:

  • I am planning to add a tuning function for the alpha and beta parameters of the gibbs sampler

Installation

You can install the development version of gsdmm from GitHub with:

# install.packages("devtools")
devtools::install_github("till-tietz/gsdmm")

Usage

Here is a minimal working example.

# we lemmatize and tokenize creating a list of character vector representing each text
text <- c(
  "Rockets are amazing.",
  "Witnessing a rocket in flight is a marvel of engineering.",
  "We should take a rocket to Mars.",
  "Rocket",
  "Have you ever seen a cat?",
  "Cats are fun.",
  "Your cat seems sweet.",
  "Cat"
) |>
  tolower()  |>
  gsub(pattern = '[[:punct:] ]+', replacement = ' ') |>
  textstem::lemmatize_strings() |>
  text2vec::word_tokenizer() |>
  lapply(function(i) i[!i %in% stopwords::stopwords()])


gsdmm::gsdmm(texts = text, n_iter = 100, n_clust = 20, alpha = 0.1, beta = 0.2)
#> [1]  4  4  6  6 18 18  2 18

About

GSDMM Short Text Clustering via Dirichlet Mixture Models

Topics

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages