flashlighttext

This package provides bindings to part of the Flashlight’s Text C++ lib. It contains beam search decoder, the KenLM language model built in, and dictionary components.

It is a translation to R of the Python’s bindings lib by Flashlight group (See flashlight-text repo).

Installation

From CRAN:

install.packages("flashlighttext")

You can install the development version of flashlighttext from GitHub with:

remotes::install_github("athospd/flashlighttext")

Examples

This text is a translation from this tutorial by Jacob Kahn.

library(flashlighttext)

Beam Search Decoder

Bindings for the lexicon and lexicon-free beam search decoders are supported for CTC/ASG models only (no seq2seq model support). Out-of-the-box language model support includes KenLM; users can define custom a language model in Python and use it for decoding; see the documentation below.

To run decoder one first should define options:

# for lexicon-based decoder
options = LexiconDecoderOptions$new(
  beam_size, # number of top hypothesis to preserve at each decoding step
  token_beam_size, # restrict number of tokens by top am scores (if you have a huge token set)
  beam_threshold, # preserve a hypothesis only if its score is not far away from the current best hypothesis score
  lm_weight, # language model weight for LM score
  word_score, # score for words appearance in the transcription
  unk_score, # score for unknown word appearance in the transcription
  sil_score, # score for silence appearance in the transcription
  log_add, # the way how to combine scores during hypotheses merging (log add operation, max)
  criterion_type # supports only CriterionTypes$ASG or CriterionTypes$CTC
)

# for lexicon free-based decoder
options = LexiconFreeDecoderOptions$new(
  beam_size, # number of top hypothesis to preserve at each decoding step
  token_beam_size, # restrict number of tokens by top am scores (if you have a huge token set)
  beam_threshold, # preserve a hypothesis only if its score is not far away from the current best hypothesis score
  lm_weight, # language model weight for LM score
  sil_score, # score for silence appearance in the transcription
  log_add, # the way how to combine scores during hypotheses merging (log add operation, max)
  criterion_type # supports only CriterionTypes$ASG or CriterionTypes$CTC
)

Now, prepare a tokens dictionary (tokens for which a model returns probability for each frame) and a lexicon (mapping between words and their spellings within a tokens set).

For further details on tokens and lexicon file formats, see the Data Preparation documentation in Flashlight.

tokens_dict <- Dictionary$new("path/tokens.txt")
tokens_dict$add_entry("<1>") 
# for ASG add used repetition symbols, for example
# tokens_dict$add_entry("1")
# tokens_dict$add_entry("2")

lexicon <- load_words("words.txt") # returns a list
lexicon[1:2]

$handsets
$handsets[[1]]
[1] "h" "a" "n" "d" "s" "e" "t" "s" "|"


$primus
$primus[[1]]
[1] "p" "r" "i" "m" "u" "s" "|"

word_dict <- create_word_dict(lexicon) # returns Dictionary

To create a KenLM language model, use:

lm <- KenLM$new("path/lm.arpa", word_dict) # or "path/lm.bin"

Loading the LM will be faster if you build a binary file.
Reading C:/Users/ap_da/AppData/Local/R/win-library/4.3/flashlighttext/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************

Get the unknown and silence token indices from the token and word dictionaries to pass to the decoder:

sil_idx <- tokens_dict$get_index("|")    # 0
unk_idx <- word_dict$get_index("<unk>")  # 21207

Now, define the lexicon Trie to restrict the beam search decoder search:

# build_trie()
trie <- Trie$new(tokens_dict$index_size(), sil_idx)
start_state <- lm$start(FALSE)
lexicon <- list2env(lexicon, hash = TRUE)
for(word in names(lexicon)) {
  spellings <- lexicon[[word]]
  usr_idx <- word_dict$get_index(word)
  score <- lm$score(start_state, usr_idx)[[2]]
  for(spelling in spellings) {
    # convert spelling string into vector of indices
    tokens_dict$map_entries_to_indices(spelling) %>%
      pack_replabels(tokens_dict, 1) %>%
      trie$insert(usr_idx, score)
  }
}

# propagate word score to each spelling node to have some lm proxy score in each node.
trie$smear(SmearingModes$MAX)

Finally, we can run lexicon-based decoder:

Name		Name	Last commit message	Last commit date
Latest commit History 173 Commits
.github		.github
R		R
data-raw		data-raw
data		data
inst		inst
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitattributes		.gitattributes
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
configure		configure
cran-comments.md		cran-comments.md
flashlighttext.Rproj		flashlighttext.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

flashlighttext

Installation

Examples

Beam Search Decoder

About

Licenses found

Releases

Packages

Contributors 2

Languages

License

Licenses found

Athospd/flashlighttext

Folders and files

Latest commit

History

Repository files navigation

flashlighttext

Installation

Examples

Beam Search Decoder

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages