Improvement of Viterbi algorithm used in tagger.cljc #28

attil-io · 2018-01-23T21:57:13Z

The Viterbi algorithm implementation in tagger.cljc uses the values of the observations, instead of their index, to store the newly calculated probabilities:


(assoc T1 [cur-state cur-observation] (*
; -----------------------------------^
                         (if-let [p  (get emissions
                                        [cur-state cur-observation    ])]
                                        p
                                        0)
                          (reduce max (vals A*T))))))

E.g., if the sentence is

"Je mange une pomme"

, then cur-observation will take the values je, mange, une, pomme. This is fine, as long as each observation appears only once in the sentence.

However, consider the following example:

"Je te montre ma montre"

. In this sentence, "montre", appears both as a noun (the first occurrence), as well as a verb (the second occurrence). The current implementation would tag it either as verb, or as noun, but it obviously cannot tag both.

The following test case reproduces the issue:
(is (= ["P" "P" "V" "P" "N"] (viterbi sample-model ["Je" "Te" "Montre" "Ma" "Montre"])))

(See tagger_test.cljc for sample-model.)

The proposed change basically consists of using the index of each observation, instead of the observation itself, to address T1.

ALL TESTS PASS

turbopape · 2018-01-23T22:26:44Z

Thank you very much ! 👍
Are you using the lib or just playing around? I would love to know !

attil-io · 2018-01-24T20:06:06Z

Just playing around :)
In fact, I'm in the process of learning Clojure, and know very little about speech tagging in general.

attil-io added 22 commits November 19, 2017 17:25

Forward declaration of fast-forward.

ec3f261

Correct test case name.

da40052

Merge remote-tracking branch 'upstream/master'

ea88d75

Ignore vim swap files.

dd092a1

Add test suite for utils.

19322d9

Assertions for bigrams.

00c7596

Assertions for similarity.

1427672

Tests for are-close-within

c034427

Assertions for find-first

633e456

Assertions for get-column

5bc0e3a

Assertions for get-row

9645aba

Correct documentation

80a1277

Test for arg-max.

81e81d6

arg-max returns nil on empty collection (instead of throwing exception)

fcf8022

Merge remote-tracking branch 'upstream/master'

9ea3b9a

Unit test for build-trie.

3f727ce

Unit tests for completions.

8f0f016

Ok test for Viterbi.

c9050a8

Test case with homonyms. (Fails)

5ff188e

Add one transition for state 'N'.

1b0dd2f

Change Viterbi implementation to make it pass the tests.

06cf63a

tagger returns vector instead of list

cfed23f

ALL TESTS PASS

turbopape merged commit 3dbfbea into turbopape:master Jan 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvement of Viterbi algorithm used in tagger.cljc #28

Improvement of Viterbi algorithm used in tagger.cljc #28

attil-io commented Jan 23, 2018 •

edited

Loading

turbopape commented Jan 23, 2018

attil-io commented Jan 24, 2018

Improvement of Viterbi algorithm used in tagger.cljc #28

Improvement of Viterbi algorithm used in tagger.cljc #28

Conversation

attil-io commented Jan 23, 2018 • edited Loading

turbopape commented Jan 23, 2018

attil-io commented Jan 24, 2018

attil-io commented Jan 23, 2018 •

edited

Loading