Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

text: Fix for garbage displayed at end of ssa subtitles #8265

Closed

Conversation

bennettpeter
Copy link
Contributor

This occurs on some mkv files. The subtitles show perfectly in VLC player
and parole. These are files encoded from TS to mkv using HandbrakeCLI
v1.3.3.

The ssa text is terminated by a null (\x00). In these mkv files the null
is followed by zero to five bytes of random data. The random data is
showing on the screen as junk characters at the end of the subtitle
text. The solution is to change SsaDecoder to use readNullTerminatedString
instead of readLine.

Sample video:
https://www.filezzz.com/d/c6f1229633284d1388d21b18ca568230/SSA_subtitle_garbage_at_end.mkv/preview

Subtitle examples from above video as seen without this fix:
(Swallows pill) Yeah. 0vd
     You read it yet?

  • Yeah, I started it. You? DTG1
  • It's on my list of many
          things to do. �O_�

This occurs on some mkv files. The subtitles show perfectly in VLC player
and parole. These are files encoded from TS to mkv using HandbrakeCLI
v1.3.3.

The ssa text is terminated by a null (\x00). In these mkv files the null
is followed by zero to five bytes of random data. The random data is
showing on the screen as junk characters at the end of the subtitle
text. The solution is to change SsaDecoder to use readNullTerminatedString
instead of readLine.

Sample video:
https://www.filezzz.com/d/c6f1229633284d1388d21b18ca568230/SSA_subtitle_garbage_at_end.mkv/preview

Subtitle examples from above video as seen without this fix:
(Swallows pill) Yeah. 0vd
     You read it yet?
- Yeah, I started it. You? DTG1
- It's on my list of many
      things to do. �O_�
@google-cla google-cla bot added the cla: yes label Nov 22, 2020
@icbaker icbaker self-assigned this Nov 23, 2020
@icbaker
Copy link
Collaborator

icbaker commented Nov 23, 2020

Thanks for the contribution - I'm not sure about this though... SSA is a fundamentally plaintext format (based originally on SubRip), so it seems weird that Handbrake is encoding null-terminated strings instead of line-terminator-terminated.

I appreciate the SSA spec is pretty fluid (/non-existent) - but this does feel more like a handbrake bug...

If we accept this I think we should keep the readLine() call and then truncate to the first NUL character in the string (rather than using readNullTerminatedString() and trying to manually split to lines).

It would also be great to add a sample to SsaDecoderTest. You might need to use a binary editor to craft a file with a NUL character (I quite like bvi).

What do you think?

@bennettpeter
Copy link
Contributor Author

I did check on another mkv files not encoded with this HandbrakeCLI, and I found that the text ends with \r\n\0 and no extra data, so I assumed null terminated was the norm.

The code I provided will still work if there is no null terminator, readNullTerminatedString will stop at the end of the string.

I first thought of your approach with readLine and search for the null in the string but I was not sure if the null character will be found in the string or be discarded by the readLine. Also it occurs to me that this is risky as you are reading random data into a string, and results may be unpredictable since data after the null may or may not be interpreted as a UTF-8 character or perhaps an incomplete UTF-8 character.

Adding a sample to SsaDecoderTest: I will need to understand how the tests work and how to run them. Do you have any guidance on that?

@icbaker
Copy link
Collaborator

icbaker commented Nov 23, 2020

Thinking about this a bit more, it's specifically a problem with SSA-inside-MKV (I think it would very weird if a plaintext, sidecar .ssa file had NUL characters in it).

The 'spec' for this doesn't mention null termination: http:https://matroska.sourceforge.net/technical/specs/subtitles/ssa.html

I wonder if we should change MatroskaExtractor to truncate subtitleSample.getData() to the first NUL byte before calling sampleData here:

setSubtitleEndTime(track.codecId, blockDurationUs, subtitleSample.getData());
// Note: If we ever want to support DRM protected subtitles then we'll need to output the
// appropriate encryption data here.
track.output.sampleData(subtitleSample, subtitleSample.limit());
size += subtitleSample.limit();

I first thought of your approach with readLine and search for the null in the string but I was not sure if the null character will be found in the string or be discarded by the readLine.

Fair point - I think this is avoided by doing the truncation in MatroskaExtractor while we're still dealing in raw bytes.

Also it occurs to me that this is risky as you are reading random data into a string, and results may be unpredictable since data after the null may or may not be interpreted as a UTF-8 character or perhaps an incomplete UTF-8 character.

I think this is also avoided if we truncate in MatroskaExtractor - we truncate before we ever attempt to parse into a string.

Adding a sample to SsaDecoderTest: I will need to understand how the tests work and how to run them. Do you have any guidance on that?

I would hold off on this for now - if we decide to change MatroskaExtractor instead then these tests won't be needed :)

But for future reference:

I'll ask for some other opinions on the team about how we want to approach this - thanks for your patience!

@icbaker
Copy link
Collaborator

icbaker commented Dec 3, 2020

Thanks for your patience on this - I spent some time testing the change I proposed above in MatroskaExtractor. It seems to work, but by the time I'd done that (and crafted some test data) it was easier to just send it for review and submit directly.

So that will be pushed to dev-v2 shortly, and then this PR won't be needed any more and you can probably close it.

@icbaker
Copy link
Collaborator

icbaker commented Dec 3, 2020

It makes sense for ExoPlayer to try and handle this nicely, since the spec is loosely defined and as you say there's already content out there exhibiting this - but there's a risk it has unforeseen consequences and we have to roll it back.

You may want to raise the issue with handbrake as well, and whichever other tools create content like this, since it seems to me the matroska samples should just be correctly sized rather than using null-termination (even if this isn't clear in the spec, it seems common sense). If they push back, perhaps it would be possible to get some clarification from the spec maintainers: https://www.matroska.org/contact.html

ojw28 pushed a commit that referenced this pull request Dec 3, 2020
This was reported for SSA/ASS in PR #8265, but it seems to me the
SubRip part of the Matroska spec is similarly loose, so this change
handles null-terminated strings in both.

#minor-release

PiperOrigin-RevId: 345452667
@bennettpeter
Copy link
Contributor Author

I opened a HandBrake issue at HandBrake/HandBrake#3258 .
I will look out for your commit and test it in my project.
Thank you for your prompt attention to all my pull requests.

@icbaker
Copy link
Collaborator

icbaker commented Dec 3, 2020

This should be fixed by 74bbd53

@bennettpeter
Copy link
Contributor Author

Tested and it looks good. Thanks.

icbaker added a commit that referenced this pull request Jan 11, 2021
This was reported for SSA/ASS in PR #8265, but it seems to me the
SubRip part of the Matroska spec is similarly loose, so this change
handles null-terminated strings in both.

PiperOrigin-RevId: 345452667
@google google locked and limited conversation to collaborators Feb 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants