Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvement of Viterbi algorithm used in tagger.cljc #28

Merged
merged 22 commits into from
Jan 23, 2018

Conversation

attil-io
Copy link
Contributor

@attil-io attil-io commented Jan 23, 2018

The Viterbi algorithm implementation in tagger.cljc uses the values of the observations, instead of their index, to store the newly calculated probabilities:


(assoc T1 [cur-state cur-observation] (*
; -----------------------------------^
                         (if-let [p  (get emissions
                                        [cur-state cur-observation    ])]
                                        p
                                        0)
                          (reduce max (vals A*T))))))

E.g., if the sentence is

"Je mange une pomme"

, then cur-observation will take the values je, mange, une, pomme. This is fine, as long as each observation appears only once in the sentence.

However, consider the following example:

"Je te montre ma montre"

. In this sentence, "montre", appears both as a noun (the first occurrence), as well as a verb (the second occurrence). The current implementation would tag it either as verb, or as noun, but it obviously cannot tag both.

The following test case reproduces the issue:
(is (= ["P" "P" "V" "P" "N"] (viterbi sample-model ["Je" "Te" "Montre" "Ma" "Montre"])))

(See tagger_test.cljc for sample-model.)

The proposed change basically consists of using the index of each observation, instead of the observation itself, to address T1.

@turbopape turbopape merged commit 3dbfbea into turbopape:master Jan 23, 2018
@turbopape
Copy link
Owner

Thank you very much ! 👍
Are you using the lib or just playing around? I would love to know !

@attil-io
Copy link
Contributor Author

Just playing around :)
In fact, I'm in the process of learning Clojure, and know very little about speech tagging in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants