Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

many repetitions & duplicates in translation #49

Open
kindziora opened this issue Oct 13, 2021 · 3 comments
Open

many repetitions & duplicates in translation #49

kindziora opened this issue Oct 13, 2021 · 3 comments

Comments

@kindziora
Copy link

Hi Guys,

Problem:
regardless of the model i use, there are situations where the translation is broken, and contains many repetitions.

One Example:

echo "30.4 C\nYaoundé\n \nLUNDI, 4 OCTOBRE 2021 11:46\nAFRIQUE CENTRALE\n \nAFRIQUE DE L’OUEST\n \nTÉLÉCOMS\n \nINNOVATION\n \nINTERNET\n \nENTRETIENS\n \nFRANÇAIS\nMORE\nAFRIQUE CENTRALE\n \nAFRIQUE DE L’OUEST\n \nTÉLÉCOMS\n \nINNOVATION\n" | ./opusMT-client.py -H localhost -s fr -t en

marian-opus-fr-en arguments

--alignment -p 11002 -b2 -n1 -m /usr/local/share/opusMT/models/fr-en/opus.npz -v /usr/local/share/opusMT/models/fr-en/opus.vocab.yml /usr/local/share/opusMT/models/fr-en/opus.vocab.yml

opusMT-opus-fr-en arguments

-p 20012 -c /var/cache/opusMT/opus.fr-en.cache.db --spm /usr/local/share/opusMT/models/fr-en/opus.fr.spm --mtport 11002 -s fr -t en

Result:

 { 
    "alignment": [
        "0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11 12-12 13-13 14-14 15-15 16-16 17-17 19-18 20-19 21-20 22-21 23-22 24-23 25-24",
        "0-0 2-105 4-3 7-1 8-2 10-4 10-184 10-194 11-5 11-45 11-55 11-60 11-65 11-110 11-115 11-120 11-125 11-130 11-135 11-140 11-145 12-6 12-81 12-86 12-91 12-141 12-151 12-156 12-161 12-166 12-231 12-236 12-241 13-50 13-70 13-75 13-80 13-85 13-90 13-95 13-100 13-150 13-155 13-160 13-165 13-170 13-175 13-180 13-185 13-190 13-195 13-200 13-205 13-210 13-215 13-220 13-225 13-230 13-235 13-240 13-245 13-250 13-255 13-260 13-265 15-8 15-13 15-83 15-88 15-138 15-143 15-148 15-153 15-158 15-163 15-168 15-173 15-178 15-203 15-208 15-213 15-218 15-223 15-228 15-233 15-238 15-243 15-248 15-253 15-258 15-263 19-259 20-261 21-7 21-72 21-77 21-82 21-227 21-232 21-237 21-242 21-247 22-9 22-14 22-84 22-89 22-94 22-99 22-139 22-144 22-149 22-154 22-159 22-164 22-169 22-174 22-179 22-219 22-224 22-229 22-234 22-239 22-244 22-249 22-254 22-264 23-10 24-11 24-16 24-251 24-256 32-15 47-96 47-171 47-176 57-47 57-52 57-62 57-67 67-18 67-43 67-48 67-53 67-58 67-63 67-68 67-73 67-78 67-93 67-98 67-103 67-108 67-113 67-118 67-123 67-128 67-133 67-183 67-188 67-193 67-198 70-12 70-252 70-257 70-262 73-19 74-20 76-25 78-23 78-28 83-21 83-266 84-17 84-22 84-27 84-57 84-87 84-92 84-97 84-102 84-107 84-112 84-117 84-122 84-127 84-132 84-137 84-142 84-147 84-152 84-157 84-162 84-167 84-172 84-177 84-182 84-187 84-192 84-197 84-202 84-207 84-212 84-217 84-222 85-24 85-104 85-109 85-114 85-119 85-189 85-199 86-30 87-26 87-31 87-36 87-41 87-46 87-51 87-56 87-61 87-66 87-71 87-101 87-106 87-111 87-116 87-121 87-126 87-131 87-136 87-146 87-181 87-186 87-191 87-196 87-201 87-206 87-211 87-216 89-33 89-38 93-32 93-37 93-42 94-29 94-34 94-39 94-44 94-49 94-54 94-59 94-64 95-35 95-40 96-76 96-221 96-226 96-246 102-69 102-74 102-79 102-124 102-129 102-134 102-204 102-209 102-214"
    ],
    "result": "30.4 C\\nYaound\u00e9\\n \\nLUNDI, 4 OCTOBER 2021 11: CENTRAL AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\n WEST AFRICA\\NLAND",
    "segmentation": "spm",
    "server": "localhost:20012",
    "source": "fr",
    "source-segments": [
        "\u258130 .4 \u2581C \\ n Y a ound \u00e9 \\ n \u2581\\ n L UND I , \u25814 \u2581 OC TO BRE \u258120 21 \u258111 :",
        "\u258146 \\ n A FR IQUE \u2581C ENT R ALE \\ n \u2581\\ n A FR IQUE \u2581DE \u2581L ' OU EST \\ n \u2581\\ n T\u00c9 L \u00c9 COM S \\ n \u2581\\ n IN NO V ATION \\ n \u2581\\ n INTER NET \\ n \u2581\\ n ENT RET IENS \\ n \u2581\\ n FR AN \u00c7 AIS \\ n M ORE \\ n A FR IQUE \u2581C ENT R ALE \\ n \u2581\\ n A FR IQUE \u2581DE \u2581L ' OU EST \\ n \u2581\\ n T\u00c9 L \u00c9 COM S \\ n \u2581\\ n IN NO V ATION \\ n"
    ],
    "target": "en",
    "target-segments": [
        "\u258130 .4 \u2581C \\ n Y a ound \u00e9 \\ n \u2581\\ n LU ND I , \u25814 \u2581OCT O BER \u258120 21 \u258111 :",
        "\u2581C ENT RAL \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ n \u2581W EST \u2581AFRICA \\ N LAND"
    ]
}

As you can see the result contains many times:WEST AFRICA

Question:
Has anybody an idea why this happens?
Could it be related to marian-decoder or sentencepiece?

Kind Regards

Alex

@jorgtied
Copy link
Member

jorgtied commented Nov 4, 2021

I think I commented on this already somewhere else: Could you try without the all-capitals text? That might work better. I believe that there are too few examples during training with capital letters all the way ...

@kindziora
Copy link
Author

Thank you for your reply, (its a re-post from Helsinki-NLP/OPUS-MT-train#62)
Now it removes all the text and puts just line breaks. :( It might be because of special characters i will investigate this further.

Kind regards

Alex

echo "30.4 c\nyaoundé\n \nlundi, 4 octobre 2021 11:46\nafrique centrale\n \nafrique de l’ouest\n \ntélécoms\n \ninnovation\n \ninternet\n \nentretiens\n \nfrançais\nmore\nafrique centrale\n \nafrique de l’ouest\n \ntélécoms\n \ninnovation\n" | ./opusMT-client.py -H localhost -s fr -t en

{
    "alignment": [
        "0-0 1-1 2-2 3-3 4-4 5-5 6-6 7-7 8-8 9-9 10-10 11-11 12-12 13-13 14-14 15-15 16-16 17-17 18-18 19-19 20-20",
        "0-0 1-1 2-2 5-3 7-4 7-6 7-8 7-10 7-12 7-14 8-5 8-7 8-9 8-11 8-13 8-15 16-17 16-19 17-16 17-18 17-20 31-22 31-24 31-26 31-28 36-30 40-32 40-34 40-36 41-33 41-35 41-37 41-39 44-41 45-38 45-40 45-42 45-44 45-46 45-48 45-50 45-52 47-43 47-45 47-47 47-49 47-51 47-53 47-55 47-57 47-59 47-61 47-63 47-65 47-67 47-87 47-89 47-91 47-93 47-95 47-97 47-99 47-101 47-103 47-105 47-107 47-109 53-54 53-56 53-58 53-60 53-62 53-64 53-66 53-68 53-70 53-72 53-74 53-76 53-78 53-80 53-82 53-84 53-86 53-88 53-90 53-92 53-94 53-96 53-98 53-100 53-102 53-104 53-106 53-108 53-110 53-112 53-114 53-116 53-118 53-120 60-21 60-23 60-25 60-27 60-29 60-31 60-147 62-111 62-113 62-115 62-117 62-119 62-121 62-123 62-143 62-145 62-159 62-179 62-181 62-183 62-185 62-187 62-189 70-140 70-142 70-144 70-146 70-148 70-150 70-152 70-156 70-158 70-160 70-162 70-164 70-166 70-168 70-170 70-172 70-174 70-176 70-178 70-180 70-182 70-184 70-186 70-188 70-190 70-192 70-194 70-196 70-198 70-200 70-212 70-214 70-216 70-220 70-222 70-224 70-226 70-230 71-139 71-141 74-69 74-71 74-73 74-75 74-77 74-79 74-81 74-83 74-85 74-125 74-127 74-129 74-131 74-133 74-135 74-137 74-149 74-151 74-153 74-155 74-157 74-161 74-163 74-165 74-167 74-169 74-171 74-173 74-175 74-177 74-191 74-193 74-195 74-197 74-199 74-201 74-203 74-205 74-207 74-209 74-211 74-213 74-215 74-217 74-219 74-221 74-223 74-225 74-227 74-229 75-122 75-124 75-126 75-128 75-130 75-132 75-134 75-136 75-138 75-154 75-202 75-204 75-206 75-208 75-210 75-218 75-228"
    ],
    "result": "30.4 c\\nyaound\u00e9\\n \\nlundi, 4 October 2021 11: 46\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n",
    "segmentation": "spm",
    "server": "localhost:20012",
    "source": "fr",
    "source-segments": [
        "\u258130 .4 \u2581c \\ ny a ound \u00e9 \\ n \u2581\\ n lu ndi , \u25814 \u2581octobre \u258120 21 \u258111 :",
        "\u258146 \\ na f rique \u2581centrale \\ n \u2581\\ na f rique \u2581de \u2581l ' ouest \\ n \u2581\\ nt \u00e9l\u00e9 com s \\ n \u2581\\ n innovation \\ n \u2581\\ n internet \\ n \u2581\\ n entretien s \\ n \u2581\\ n fran\u00e7ais \\ n more \\ na f rique \u2581centrale \\ n \u2581\\ na f rique \u2581de \u2581l ' ouest \\ n \u2581\\ nt \u00e9l\u00e9 com s \\ n \u2581\\ n innovation \\ n"
    ],
    "target": "en",
    "target-segments": [
        "\u258130 .4 \u2581c \\ ny a ound \u00e9 \\ n \u2581\\ n l undi , \u25814 \u2581October \u258120 21 \u258111 :",
        "\u258146 \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n \\ n"
    ]
}
`

@jorgtied
Copy link
Member

I guess that the input text is just too different from what the model has seen in training. It is trained with sentences but the input is very much fragmented with short terms and phrases. Does that also happen with full sentences on one line as input?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants