Skip to content

Athospd/flashlighttext

Repository files navigation

flashlighttext

Lifecycle: experimental R-CMD-check CRAN status

This package provides bindings to part of the Flashlight’s Text C++ lib. It contains beam search decoder, the KenLM language model built in, and dictionary components.

It is a translation to R of the Python’s bindings lib by Flashlight group (See flashlight-text repo).

Installation

From CRAN:

install.packages("flashlighttext")

You can install the development version of flashlighttext from GitHub with:

remotes::install_github("athospd/flashlighttext")

Examples

This text is a translation from this tutorial by Jacob Kahn.

library(flashlighttext)

Beam Search Decoder

Bindings for the lexicon and lexicon-free beam search decoders are supported for CTC/ASG models only (no seq2seq model support). Out-of-the-box language model support includes KenLM; users can define custom a language model in Python and use it for decoding; see the documentation below.

To run decoder one first should define options:

# for lexicon-based decoder
options = LexiconDecoderOptions$new(
  beam_size, # number of top hypothesis to preserve at each decoding step
  token_beam_size, # restrict number of tokens by top am scores (if you have a huge token set)
  beam_threshold, # preserve a hypothesis only if its score is not far away from the current best hypothesis score
  lm_weight, # language model weight for LM score
  word_score, # score for words appearance in the transcription
  unk_score, # score for unknown word appearance in the transcription
  sil_score, # score for silence appearance in the transcription
  log_add, # the way how to combine scores during hypotheses merging (log add operation, max)
  criterion_type # supports only CriterionTypes$ASG or CriterionTypes$CTC
)

# for lexicon free-based decoder
options = LexiconFreeDecoderOptions$new(
  beam_size, # number of top hypothesis to preserve at each decoding step
  token_beam_size, # restrict number of tokens by top am scores (if you have a huge token set)
  beam_threshold, # preserve a hypothesis only if its score is not far away from the current best hypothesis score
  lm_weight, # language model weight for LM score
  sil_score, # score for silence appearance in the transcription
  log_add, # the way how to combine scores during hypotheses merging (log add operation, max)
  criterion_type # supports only CriterionTypes$ASG or CriterionTypes$CTC
)

Now, prepare a tokens dictionary (tokens for which a model returns probability for each frame) and a lexicon (mapping between words and their spellings within a tokens set).

For further details on tokens and lexicon file formats, see the Data Preparation documentation in Flashlight.

tokens_dict <- Dictionary$new("path/tokens.txt")
tokens_dict$add_entry("<1>") 
# for ASG add used repetition symbols, for example
# tokens_dict$add_entry("1")
# tokens_dict$add_entry("2")

lexicon <- load_words("words.txt") # returns a list
lexicon[1:2]
$handsets
$handsets[[1]]
[1] "h" "a" "n" "d" "s" "e" "t" "s" "|"


$primus
$primus[[1]]
[1] "p" "r" "i" "m" "u" "s" "|"
word_dict <- create_word_dict(lexicon) # returns Dictionary

To create a KenLM language model, use:

lm <- KenLM$new("path/lm.arpa", word_dict) # or "path/lm.bin"
Loading the LM will be faster if you build a binary file.
Reading C:/Users/ap_da/AppData/Local/R/win-library/4.3/flashlighttext/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************

Get the unknown and silence token indices from the token and word dictionaries to pass to the decoder:

sil_idx <- tokens_dict$get_index("|")    # 0
unk_idx <- word_dict$get_index("<unk>")  # 21207

Now, define the lexicon Trie to restrict the beam search decoder search:

# build_trie()
trie <- Trie$new(tokens_dict$index_size(), sil_idx)
start_state <- lm$start(FALSE)
lexicon <- list2env(lexicon, hash = TRUE)
for(word in names(lexicon)) {
  spellings <- lexicon[[word]]
  usr_idx <- word_dict$get_index(word)
  score <- lm$score(start_state, usr_idx)[[2]]
  for(spelling in spellings) {
    # convert spelling string into vector of indices
    tokens_dict$map_entries_to_indices(spelling) %>%
      pack_replabels(tokens_dict, 1) %>%
      trie$insert(usr_idx, score)
  }
}

# propagate word score to each spelling node to have some lm proxy score in each node.
trie$smear(SmearingModes$MAX)

Finally, we can run lexicon-based decoder: