Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[youtube] still download translated_subs with --extractor-args youtube:skip=translated_subs #4090

Closed
6 of 7 tasks
NewUserHa opened this issue Jun 16, 2022 · 13 comments
Closed
6 of 7 tasks
Labels
question Question

Comments

@NewUserHa
Copy link

NewUserHa commented Jun 16, 2022

Checklist

Region

No response

Description

[info] Available automatic captions for 8-nXN9ZMl8o:
Language Name                  Formats
af       Afrikaans             vtt, ttml, srv3, srv2, srv1, json3
sq       Albanian              vtt, ttml, srv3, srv2, srv1, json3
am       Amharic               vtt, ttml, srv3, srv2, srv1, json3
ar       Arabic                vtt, ttml, srv3, srv2, srv1, json3
hy       Armenian              vtt, ttml, srv3, srv2, srv1, json3
az       Azerbaijani           vtt, ttml, srv3, srv2, srv1, json3
bn       Bangla                vtt, ttml, srv3, srv2, srv1, json3
eu       Basque                vtt, ttml, srv3, srv2, srv1, json3
be       Belarusian            vtt, ttml, srv3, srv2, srv1, json3
bs       Bosnian               vtt, ttml, srv3, srv2, srv1, json3
bg       Bulgarian             vtt, ttml, srv3, srv2, srv1, json3
my       Burmese               vtt, ttml, srv3, srv2, srv1, json3
ca       Catalan               vtt, ttml, srv3, srv2, srv1, json3
ceb      Cebuano               vtt, ttml, srv3, srv2, srv1, json3
zh-Hans  Chinese (Simplified)  vtt, ttml, srv3, srv2, srv1, json3
zh-Hant  Chinese (Traditional) vtt, ttml, srv3, srv2, srv1, json3
co       Corsican              vtt, ttml, srv3, srv2, srv1, json3
hr       Croatian              vtt, ttml, srv3, srv2, srv1, json3
cs       Czech                 vtt, ttml, srv3, srv2, srv1, json3
da       Danish                vtt, ttml, srv3, srv2, srv1, json3
nl       Dutch                 vtt, ttml, srv3, srv2, srv1, json3
en-orig  English (Original)    vtt, ttml, srv3, srv2, srv1, json3
en       English               vtt, ttml, srv3, srv2, srv1, json3
eo       Esperanto             vtt, ttml, srv3, srv2, srv1, json3
et       Estonian              vtt, ttml, srv3, srv2, srv1, json3
fil      Filipino              vtt, ttml, srv3, srv2, srv1, json3
fi       Finnish               vtt, ttml, srv3, srv2, srv1, json3
fr       French                vtt, ttml, srv3, srv2, srv1, json3
gl       Galician              vtt, ttml, srv3, srv2, srv1, json3
ka       Georgian              vtt, ttml, srv3, srv2, srv1, json3
de       German                vtt, ttml, srv3, srv2, srv1, json3
el       Greek                 vtt, ttml, srv3, srv2, srv1, json3
gu       Gujarati              vtt, ttml, srv3, srv2, srv1, json3
ht       Haitian Creole        vtt, ttml, srv3, srv2, srv1, json3
ha       Hausa                 vtt, ttml, srv3, srv2, srv1, json3
haw      Hawaiian              vtt, ttml, srv3, srv2, srv1, json3
iw       Hebrew                vtt, ttml, srv3, srv2, srv1, json3
hi       Hindi                 vtt, ttml, srv3, srv2, srv1, json3
hmn      Hmong                 vtt, ttml, srv3, srv2, srv1, json3
hu       Hungarian             vtt, ttml, srv3, srv2, srv1, json3
is       Icelandic             vtt, ttml, srv3, srv2, srv1, json3
ig       Igbo                  vtt, ttml, srv3, srv2, srv1, json3
id       Indonesian            vtt, ttml, srv3, srv2, srv1, json3
ga       Irish                 vtt, ttml, srv3, srv2, srv1, json3
it       Italian               vtt, ttml, srv3, srv2, srv1, json3
ja       Japanese              vtt, ttml, srv3, srv2, srv1, json3
jv       Javanese              vtt, ttml, srv3, srv2, srv1, json3
kn       Kannada               vtt, ttml, srv3, srv2, srv1, json3
kk       Kazakh                vtt, ttml, srv3, srv2, srv1, json3
km       Khmer                 vtt, ttml, srv3, srv2, srv1, json3
rw       Kinyarwanda           vtt, ttml, srv3, srv2, srv1, json3
ko       Korean                vtt, ttml, srv3, srv2, srv1, json3
ku       Kurdish               vtt, ttml, srv3, srv2, srv1, json3
ky       Kyrgyz                vtt, ttml, srv3, srv2, srv1, json3
lo       Lao                   vtt, ttml, srv3, srv2, srv1, json3
la       Latin                 vtt, ttml, srv3, srv2, srv1, json3
lv       Latvian               vtt, ttml, srv3, srv2, srv1, json3
lt       Lithuanian            vtt, ttml, srv3, srv2, srv1, json3
lb       Luxembourgish         vtt, ttml, srv3, srv2, srv1, json3
mk       Macedonian            vtt, ttml, srv3, srv2, srv1, json3
mg       Malagasy              vtt, ttml, srv3, srv2, srv1, json3
ms       Malay                 vtt, ttml, srv3, srv2, srv1, json3
ml       Malayalam             vtt, ttml, srv3, srv2, srv1, json3
mt       Maltese               vtt, ttml, srv3, srv2, srv1, json3
mi       Māori                 vtt, ttml, srv3, srv2, srv1, json3
mr       Marathi               vtt, ttml, srv3, srv2, srv1, json3
mn       Mongolian             vtt, ttml, srv3, srv2, srv1, json3
ne       Nepali                vtt, ttml, srv3, srv2, srv1, json3
no       Norwegian             vtt, ttml, srv3, srv2, srv1, json3
ny       Nyanja                vtt, ttml, srv3, srv2, srv1, json3
or       Odia                  vtt, ttml, srv3, srv2, srv1, json3
ps       Pashto                vtt, ttml, srv3, srv2, srv1, json3
fa       Persian               vtt, ttml, srv3, srv2, srv1, json3
pl       Polish                vtt, ttml, srv3, srv2, srv1, json3
pt       Portuguese            vtt, ttml, srv3, srv2, srv1, json3
pa       Punjabi               vtt, ttml, srv3, srv2, srv1, json3
ro       Romanian              vtt, ttml, srv3, srv2, srv1, json3
ru       Russian               vtt, ttml, srv3, srv2, srv1, json3
sm       Samoan                vtt, ttml, srv3, srv2, srv1, json3
gd       Scottish Gaelic       vtt, ttml, srv3, srv2, srv1, json3
sr       Serbian               vtt, ttml, srv3, srv2, srv1, json3
sn       Shona                 vtt, ttml, srv3, srv2, srv1, json3
sd       Sindhi                vtt, ttml, srv3, srv2, srv1, json3
si       Sinhala               vtt, ttml, srv3, srv2, srv1, json3
sk       Slovak                vtt, ttml, srv3, srv2, srv1, json3
sl       Slovenian             vtt, ttml, srv3, srv2, srv1, json3
so       Somali                vtt, ttml, srv3, srv2, srv1, json3
st       Southern Sotho        vtt, ttml, srv3, srv2, srv1, json3
es       Spanish               vtt, ttml, srv3, srv2, srv1, json3
su       Sundanese             vtt, ttml, srv3, srv2, srv1, json3
sw       Swahili               vtt, ttml, srv3, srv2, srv1, json3
sv       Swedish               vtt, ttml, srv3, srv2, srv1, json3
tg       Tajik                 vtt, ttml, srv3, srv2, srv1, json3
ta       Tamil                 vtt, ttml, srv3, srv2, srv1, json3
tt       Tatar                 vtt, ttml, srv3, srv2, srv1, json3
te       Telugu                vtt, ttml, srv3, srv2, srv1, json3
th       Thai                  vtt, ttml, srv3, srv2, srv1, json3
tr       Turkish               vtt, ttml, srv3, srv2, srv1, json3
tk       Turkmen               vtt, ttml, srv3, srv2, srv1, json3
uk       Ukrainian             vtt, ttml, srv3, srv2, srv1, json3
ur       Urdu                  vtt, ttml, srv3, srv2, srv1, json3
ug       Uyghur                vtt, ttml, srv3, srv2, srv1, json3
uz       Uzbek                 vtt, ttml, srv3, srv2, srv1, json3
vi       Vietnamese            vtt, ttml, srv3, srv2, srv1, json3
cy       Welsh                 vtt, ttml, srv3, srv2, srv1, json3
fy       Western Frisian       vtt, ttml, srv3, srv2, srv1, json3
xh       Xhosa                 vtt, ttml, srv3, srv2, srv1, json3
yi       Yiddish               vtt, ttml, srv3, srv2, srv1, json3
yo       Yoruba                vtt, ttml, srv3, srv2, srv1, json3
zu       Zulu                  vtt, ttml, srv3, srv2, srv1, json3
[info] Available subtitles for 8-nXN9ZMl8o:
Language Name    Formats
en       English vtt, ttml, srv3, srv2, srv1, json3

but other links no issue with --extractor-args youtube:skip=translated_subs

Verbose log

[debug] Command-line config: ['-vU', '--write-sub', '--write-auto-sub', '--convert-subs', 'srt', '--sub-lang', 'en.*,en-US,ja', '--extractor-args', 'youtube:skip=translated_subs', 'https://youtu.be/8-nXN9ZMl8o', '--skip-download']
[debug] Encodings: locale cp936, fs utf-8, pref cp936, out utf-8, error utf-8, screen utf-8
[debug] yt-dlp version 2022.05.18 [b14d52355]
[debug] Python version 3.10.5 (CPython 64bit) - Windows-10-10.0.17134-SP0
[debug] Checking exe version: ffprobe -bsfs
[debug] Checking exe version: ffmpeg -bsfs
[debug] exe versions: ffmpeg 5.0-full_build-www.gyan.dev (setts), ffprobe 5.0-full_build-www.gyan.dev, phantomjs 2.1.1
[debug] Optional libraries: Cryptodome-3.14.1, brotli-1.0.9, certifi-2022.05.18.1, mutagen-1.45.1, sqlite3-2.6.0, websockets-10.3
[debug] Proxy map: {}
Latest version: 2022.05.18, Current version: 2022.05.18
yt-dlp is up to date (2022.05.18)
[debug] [youtube] Extracting URL: https://youtu.be/8-nXN9ZMl8o
[youtube] 8-nXN9ZMl8o: Downloading webpage
[youtube] 8-nXN9ZMl8o: Downloading android player API JSON
[debug] Sort order given by extractor: quality, res, fps, hdr:12, source, codec:vp9.2, lang, proto
[debug] Formats sorted by: hasvid, ie_pref, quality, res, fps, hdr:12(7), source, vcodec:vp9.2(10), acodec, lang, proto, filesize, fs_approx, tbr, vbr, abr, asr, vext, aext, hasaud, id
[debug] Downloading subtitles: en, en-orig, ja
[debug] Default format spec: bestvideo*+bestaudio/best
[info] 8-nXN9ZMl8o: Downloading 1 format(s): 313+251
[info] Writing video subtitles to: Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en.vtt
[debug] Invoking http downloader on "https://www.youtube.com/api/timedtext?v=8-nXN9ZMl8o&caps=asr&xoaf=4&xosf=1&hl=en&ip=0.0.0.0&ipbits=0&expire=1655430436&sparams=ip%2Cipbits%2Cexpire%2Cv%2Ccaps%2Cxoaf&signature=87AEC7EF4D6D831C054E683C9EA46B408263B8AD.5AC6D3A47C43254FD431685FB3DC450E6B336BA1&key=yt8&lang=en&fmt=vtt"
[debug] File locking is not supported on this platform. Proceeding without locking
[download] Destination: Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en.vtt
[download] 100% of 10.34KiB in 00:00
[info] Writing video subtitles to: Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en-orig.vtt
[debug] Invoking http downloader on "https://www.youtube.com/api/timedtext?v=8-nXN9ZMl8o&caps=asr&xoaf=4&xosf=1&hl=en&ip=0.0.0.0&ipbits=0&expire=1655430436&sparams=ip%2Cipbits%2Cexpire%2Cv%2Ccaps%2Cxoaf&signature=87AEC7EF4D6D831C054E683C9EA46B408263B8AD.5AC6D3A47C43254FD431685FB3DC450E6B336BA1&key=yt8&kind=asr&lang=en&fmt=vtt"
[download] Destination: Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en-orig.vtt
[download] 100% of 16.73KiB in 00:00
[info] Writing video subtitles to: Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].ja.vtt
[debug] Invoking http downloader on "https://www.youtube.com/api/timedtext?v=8-nXN9ZMl8o&caps=asr&xoaf=4&xosf=1&hl=en&ip=0.0.0.0&ipbits=0&expire=1655430436&sparams=ip%2Cipbits%2Cexpire%2Cv%2Ccaps%2Cxoaf&signature=87AEC7EF4D6D831C054E683C9EA46B408263B8AD.5AC6D3A47C43254FD431685FB3DC450E6B336BA1&key=yt8&kind=asr&lang=en&tlang=ja&fmt=vtt"
[download] Destination: Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].ja.vtt
[download] 100% of 17.42KiB in 00:00
[SubtitlesConvertor] Converting subtitles
[debug] ffmpeg command line: ffmpeg -y -loglevel "repeat+info" -i "file:Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en.vtt" -f srt -movflags "+faststart" "file:Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en.srt"
[debug] ffmpeg command line: ffmpeg -y -loglevel "repeat+info" -i "file:Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en-orig.vtt" -f srt -movflags "+faststart" "file:Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en-orig.srt"
[debug] ffmpeg command line: ffmpeg -y -loglevel "repeat+info" -i "file:Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].ja.vtt" -f srt -movflags "+faststart" "file:Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].ja.srt"
Deleting original file Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].ja.vtt (pass -k to keep)
Deleting original file Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en-orig.vtt (pass -k to keep)
Deleting original file Overwatch Animated Short _ “The Wastelander” [4K] [8-nXN9ZMl8o].en.vtt (pass -k to keep)
@NewUserHa NewUserHa added site-bug Issue with a specific website triage Untriaged issue labels Jun 16, 2022
@pukkandan
Copy link
Member

They are auto-generated subs, not auto-translated ones. See the output of --list-subs without the extractor-arg

@pukkandan pukkandan added question Question and removed site-bug Issue with a specific website triage Untriaged issue labels Jun 16, 2022
@pukkandan
Copy link
Member

Duplicate of #3875

@pukkandan pukkandan marked this as a duplicate of #3875 Jun 16, 2022
@NewUserHa
Copy link
Author

I checked those subs( the ja one), but it is auto-translated too.

@pukkandan
Copy link
Member

YouTube auto-generates subtitles in every language. If the video has normal subtitles, each of them can also be translated to all languages. The extractor-arg is intended to skip only the latter case in order to improve performance. For proper subtitle selection, use --subtitle-lang

@NewUserHa
Copy link
Author

I checked the web player, and only the en and en-auto are listed out.
the web player sees the 'auto-generated' subs as auto-translated obviously

@NewUserHa
Copy link
Author

maybe actually those subs are just useless and are not for real users?

@NewUserHa
Copy link
Author

because auto-translated subs in most cases are not readable at all. and the only useful subs are

  • uploaded by ppl
  • auto-gened CC from speech

So may need a way to only download those.
and maybe still need a --extractor-args youtube:skip=auto-gened_subs(but except the CC?)?

@chrizilla
Copy link

@pukkandan : YouTube auto-generates subtitles in every language. If the video has normal subtitles, each of them can also be translated to all languages. The extractor-arg is intended to skip only the latter case in order to improve performance. For proper subtitle selection, use --subtitle-lang

This is a great explanation. Thank you.

In a youtube video in English language only and 1 English subtitle track, what is the difference between for example Zulu automatic caption (zu) and Zulu-from-English automatic caption (zu-en) ?
Is zu translated from the English subtitle and zu-en translated from the English automatic caption ?

@pukkandan
Copy link
Member

My understanding is that zu is tts from the audio and zu-en is translated from English subs

@chrizilla
Copy link

chrizilla commented Apr 15, 2023

My understanding is that zu is tts from the audio

Here is why I would doubt that:

Example: https://youtu.be/SoEkCshMcOY is a short clip of Donald Trump talking in English.

content of en:

<p begin="00:00:00.030" end="00:00:04.440" style="s2">much of it going to farmers and</p>
<p begin="00:00:01.860" end="00:00:06.629" style="s2">manufacturers so I&#39;ll let you know I</p>
<p begin="00:00:04.440" end="00:00:11.519" style="s2">mean I hope they got honor the deal</p>
<p begin="00:00:06.629" end="00:00:12.990" style="s2">what are you working for China I work</p>
<p begin="00:00:11.519" end="00:00:16.020" style="s2">with China are you with in this paper</p>
<p begin="00:00:12.990" end="00:00:19.439" style="s2">who are you with Bennett TV who owns</p>
<p begin="00:00:16.020" end="00:00:23.400" style="s2">that China aims it all by China or is it</p>
<p begin="00:00:19.439" end="00:00:26.460" style="s2">owned by the state no it&#39;s not ok good</p>
<p begin="00:00:23.400" end="00:00:28.080" style="s2">ok look I&#39;ll let you know I&#39;ll give you</p>
<p begin="00:00:26.460" end="00:00:30.000" style="s2">a good answer to that in a few months I</p>
<p begin="00:00:28.080" end="00:00:32.130" style="s2">wanted to see what they do because it&#39;s</p>
<p begin="00:00:30.000" end="00:00:34.920" style="s2">time for them to help us ok it&#39;s time</p>
<p begin="00:00:32.130" end="00:00:37.380" style="s2">right now for China to help us and</p>
<p begin="00:00:34.920" end="00:00:39.980" style="s2">hopefully they do and if they don&#39;t</p>
<p begin="00:00:37.380" end="00:00:39.980" style="s2">that&#39;s okay too</p>

content of zu :

<p begin="00:00:00.030" end="00:00:04.440" style="s2">okuningi kuya kubalimi nabakhiqizi</p>
<p begin="00:00:01.860" end="00:00:06.629" style="s2">ngakho ngizonazisa ngiqonde ukuthi ngithemba ukuthi</p>
<p begin="00:00:04.440" end="00:00:11.519" style="s2">bathole ukuhlonishwa isivumelwano</p>
<p begin="00:00:06.629" end="00:00:12.990" style="s2">usebenzela iShayina ngisebenza</p>
<p begin="00:00:11.519" end="00:00:16.020" style="s2">neChina ukhona nobani kuleli phepha</p>
<p begin="00:00:12.990" end="00:00:19.439" style="s2">wena noBennett TV ongumnikazi walelo</p>
<p begin="00:00:16.020" end="00:00:23.400" style="s2">China  ihlose iShayina yonke noma</p>
<p begin="00:00:19.439" end="00:00:26.460" style="s2">iphethwe umbuso cha akulungile kulungile bheka ngizonazisa</p>
<p begin="00:00:23.400" end="00:00:28.080" style="s2">ngizoninika</p>
<p begin="00:00:26.460" end="00:00:30.000" style="s2">impendulo eqondile kulokho ezinyangeni ezimbalwa</p>
<p begin="00:00:28.080" end="00:00:32.130" style="s2">bengifuna ukubona ukuthi benzani ngoba</p>
<p begin="00:00:30.000" end="00:00:34.920" style="s2">sekuyisikhathi sokuthi  ukuthi basisize kulungile sekuyisikhathi</p>
<p begin="00:00:32.130" end="00:00:37.380" style="s2">manje sokuthi iChina isisize futhi</p>
<p begin="00:00:34.920" end="00:00:39.980" style="s2">ngethemba ukuthi izosisiza futhi uma ingakwenzi</p>
<p begin="00:00:37.380" end="00:00:39.980" style="s2">lokho kulungile futhi</p>

Trump doesn't speak Zulu, so this must be auto-translated rather than text-2-speech.

@pukkandan
Copy link
Member

Obviously it is being translated. But from audio instead of from subs. Internally, they may be generating one auto sub first and then translating to other languages. Or they may have some system to do it directly from speech. I have no way of knowing.

@chrizilla
Copy link

Is zu translated from the English subtitle and zu-en translated from the English automatic caption ?

My understanding is that zu is tts from the audio and zu-en is translated from English subs

So if I get that right, you think it's the other way round:

  • zu is translated from the English automatic caption
  • zu-en is translated from the English text subtitle

@pukkandan
Copy link
Member

Yes. Due to the way subs are extracted, I know for fact that zu-en is translated from English subtitle. The other one is my educated guess - I'm pretty sure it's correct, but don't really have any hard evidence either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

3 participants