-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support PNG subtitles in SMPTE-TT (Issue: #1583) #5066
Conversation
@ojw28 You can see my changes on branch feature/1583_support_png_ttml, I also added a test stream for now: The test stream has 3 different subtitle tracks. The first one uses exact pixels inside its regions, which is not supported (as far as I seen) so right know it's not shown, because no region will be parsed and the bitmap width is set by the anonymous region, which is 0 (need to add a default width in this case). However you should be able to see the second subtitle. It has three cues as you can see in the xml. The thing is that the I tried also to filter these cues inside I am trying to understand how this really works, but any help would be appreciated. Also if you have any other observations, please let me know. For example, I am not entirely sure if Thanks in advance! |
I think you're on the right track.
|
Look for "div" tags when generating the set of times.
Thank you for the tips! Quick questions:
|
|
So to sum up, the result for unsupported regions looks like this right now:
As far as I see, yes. It is already working like this.
Should I add here the 0.5f default width? Which is currently handled in
|
Regarding the unsupported region: If the decoder doesn't know where a cue should be displayed, it IMO shouldn't fill in any defaults. So I think the correct thing to do is for the decoder to leave everything unset (i.e. not fill in 0.5 for the width). Which is what happens for text, I think. It should then be up to |
Yeah, you are right :) I'll do these changes on the weekend! |
@ojw28 |
@@ -68,6 +68,8 @@ | |||
private static final String ATTR_END = "end"; | |||
private static final String ATTR_STYLE = "style"; | |||
private static final String ATTR_REGION = "region"; | |||
private static final String ATTR_IMAGE = "backgroundImage"; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - please remove extra blank line
@@ -127,7 +130,7 @@ protected TtmlSubtitle decode(byte[] bytes, int length, boolean reset) | |||
Log.i(TAG, "Ignoring unsupported tag: " + xmlParser.getName()); | |||
unsupportedNodeDepth++; | |||
} else if (TtmlNode.TAG_HEAD.equals(name)) { | |||
parseHeader(xmlParser, globalStyles, regionMap, cellResolution); | |||
parseHeader(xmlParser, globalStyles, regionMap, cellResolution, imageMap); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - It might be nice for the two "Map" variables to be next to one another (consider this ignorable :)). It would also be nicely consistent with the TtmlSubtitle constructor if this were the case (which takes regionMap
and then imageMap
as the next argument).
} while (!XmlPullParserUtil.isEndTag(xmlParser, TtmlNode.TAG_HEAD)); | ||
return globalStyles; | ||
} | ||
|
||
public void parseMetaData(XmlPullParser xmlParser, Map<String, String> imageMap) throws IOException, XmlPullParserException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - We consider metadata to be one word, so should be Metadata
rather than MetaData
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method should be private.
do { | ||
xmlParser.next(); | ||
if (XmlPullParserUtil.isStartTag(xmlParser, TtmlNode.TAG_SMPTE_IMAGE)) { | ||
for (int i = 0; i < xmlParser.getAttributeCount(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think looking here does anything. You can just execute what's inside the loop once, I think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, totally correct. First I was manually searching for the "id" attribute, then I found out that there is an util method for this. Forgot to remove the loop. Thanks :)
TreeMap<String, SpannableStringBuilder> regionOutputs = new TreeMap<>(); | ||
List<Pair<String, String>> regionImageList = new ArrayList<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be clearer to call these regionTextOutputs
and regionImageOutputs
or something, to make it clearer one is the image equivalent of the other.
|
||
// Create image based cues | ||
for (Pair<String, String> regionImagePair : regionImageList) { | ||
String base64 = imageMap.get(regionImagePair.second); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably worth handling the case that imageMap.get
returns null
(i.e. do nothing in this case, because the image ref didn't point to a valid image).
cues.add( | ||
new Cue(decodedByte, | ||
region.position, | ||
Cue.ANCHOR_TYPE_MIDDLE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this definitely correct? As opposed to TYPE_UNSET, which is what's used for the position anchor in the text case below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the SubtitlePainter
should decide how to render this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I know why I put this here. SubtitlePainter.setupBitmapLayout()
already doing some logic based on the position anchor. And I didn't want to affect that by applying logic like:
cuePositionAnchor == Cue.TYPE_UNSET
then set anchor type middle because other subtitle decoders might rely on this, that's why I moved up this into the TtmlNode
(inspiring from DvbParser
).
Right now for unsupported region I set the position anchor to middle as default, (see your suggestion below) however this is not the case for supported regions, as cuePosition == Cue.DIMEN_UNSET
will be false in SubtitlePainter. setupBitmapLayout()
.
Without applying this position anchor, the vertical position won't be the same as in the dashif player. What do you think, should I leave this anchor assignment inside TtmlNode
? Or we could modify the fallback logic for unset position anchors inside setupBitmapLayout
, like this:
int y = Math.round(cuePositionAnchor == Cue.ANCHOR_TYPE_END ? (anchorY - height)
: cuePositionAnchor == Cue.ANCHOR_TYPE_MIDDLE ? (anchorY - (height / 2)) : anchorY);
to
int y = Math.round(cuePositionAnchor == Cue.ANCHOR_TYPE_END ? (anchorY - height)
: cuePositionAnchor == Cue.ANCHOR_TYPE_START ? anchorY : (anchorY - (height / 2)));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think you should modify setupBitmapLayout
to do something sensible if things are unset, and output it as unset from the decoder. There aren't any subtitle decoders setting unset anchors at the moment, so adding logic for that case in setupBitmapLayout
should be fine (this is also probably why such logic is missing ;)).
@@ -355,6 +355,22 @@ private void setupTextLayout() { | |||
} | |||
|
|||
private void setupBitmapLayout() { | |||
// Default position | |||
if (cuePosition == Cue.DIMEN_UNSET) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add cuePositionAnchor = ANCHOR_TYPE_MIDDLE
to this block?
|
||
// Default width | ||
if (cueSize == Cue.DIMEN_UNSET) { | ||
cueSize = 0.5f; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way these defaults work can cause a subtitle to be drawn in quite a weird way, I think. This looks like it'll default the cue to occupy 50% of the screen horizontally. Suppose the bitmap subtitle is a very short word, like "Hi". Wont it end up being sized to occupy nearly the entire vertical space in this case :)?
You may need to do something a little more clever, where you define a box within which you want the bitmap to be displayed (e.g. 50% of space horizontally and 10% vertically), and then figure out how to compute the bitmapRect
so that it doesn't go outside of that box.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right :)
And lets say "Hi" and "Hi, how are you?" should have the same height, that's not case now. I mean the cues should try to have the same height until they can fit, or something like this. I'll try to find a solution for this today.
The other adjustments based on the comments are done, but haven't pushed those yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the general case I don't think it's possible for you to keep the line height the same if the region hasn't been defined. Doing so would require you to know how many lines are in the bitmap, and you don't know that (so really, having the region being defined properly is fairly important). I think fitting withing a box that constraints it to a sensibly small portion of the screen is the best you can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One idea is to fix the height at 10% as you said and make the bitmap to fit this height (keeping aspect ratio), however if the bitmap goes beyond the box's width (50%) then will be scaled down by its width until hits the "box" (keeping aspect ratio here as well). I think this is an acceptable solution for "one line" images, however if the cue has two lines then the "line height" will be the half of specified box's height (= 5%). I don't see a solution for this particular case, because we don't know all of the bitmap's sizes ahead (e.g. live tv).
Do you think this is a viable solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you describe is the best you can do (short of something crazy like using OCR to figure out how many lines there are - which isn't something we're realistically going to do :)).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it's easier to add support for regions that are defined by pixels, than using OCR, however nice idea, hehe :)
Thanks for the quick response, I will do the way I described then.
Adjustments are done based on the comments, please have an other look on them. |
I just tested this with a real live stream, and the subs were extra small, I need to investigate why. |
@ojw28
This leads to a problem, because based on the logic that we discussed yesterday this "full screen" image will be pushed into a smaller box (50% width, 10% height) and the cues will be extremely small, impossible to read. If I manually set the cueSize and cueBitmapHeight to 1.0f for this particular TTML, then it is rendered "correctly", same way as on other players. Do you think we can add support for regions that are defined by pixels? Or can you think of any shortcut that could solve this edge case? One idea would be to parse the pixels also, and check the region size against common video resolutions (1920x1080, 1280x720, 720x576 etc), and assume fullscreen if so. Of course this would be checked only in case of image based cues. |
In my opinion the proposed solution won't work, the problem is that videoSize doesn't correlate with the region pixels in TTML subs. So in case of adaptive streams, the video size changes but the sub size (720x546) stays constant. This will result varying cue sizes. Do you see the other proposed solution viable? To check the region size against common video sizes and assume as full screen in case of a match? |
I think that pixel regions defined in TTML captions must be in the context of some window, also in pixels, that's defined one way or another. My proposal above was based on the assumption that the window is defined by the resolution of the original (non-adaptive) video stream to which the TTML captions correspond.
I would rather we not do this. It sounds like a hack to work around the serving side not doing the right thing. Thoughts? |
Correct, but it would be better to handle non-adaptive and adaptive streams together.
The TTML should specify a window indeed, currently I am going through the specs / standards to find an official way to solve this. |
As it turned out, the window shall be present when the regions are defined in pixels.
This tts:extent is present in the test stream's TTML from dashif: For yet unknown reasons, the Live TV streams from our app which I used to previously to test this, doesn't send this attribute, we are trying to find out why. However, I would like to propose this solution:
What do you think? |
You should be aware that the handling of images in SMPTE-TT is a little different from IMSC 1.0.1 Image Profile. I'm not clear what format the content is in, but my recommendation would be to go for IMSC support if possible. IMSC 1.0.1 Image Profile is the same as IMSC 1.1 Image profile in this respect. |
Thank you for pointing out this, however unfortunately I am not aware or fully understand the differences between these profiles. The test stream which I used to add support for images in TTML is from dashif: https://livesim.dashif.org/dash/vod/testpic_2s/img_subs_region_px.xml. But in our app, a couple of live tv streams have TTML subs with EBU profiles. I think the main goal (if @ojw28 also agrees) would be to add a basic support for images inside the These basic "features" are the same in different profiles? |
@szaboa the file you linked is not an IMSC 1.x Image Profile document, because it encodes the images in Due to issues with |
Need to research what is the difference between |
I don't know details of how SMPTE-TT is used in practice, but I'd have thought that handling EBU doesn't use image subtitles at all - all the EBU subtitle specs, and the IMSC Text Profile prohibit use of images altogether. |
@nigelmegitt
But the images are defined with |
I would hope that the various profiles aren't actually incompatible with one another. If that's the case then I'm not sure "what we should go with" is the right question to be asking, because it's not a choice of one or the other. It seems reasonable that we would accept contributions covering whatever features and/or profiles contributors wish to add support for.
This sounds good to me.
Do you have any evidence that this is the best thing to do, beyond it being correct for the streams you happen to want to play? I'm not sure that's a particularly good reason (vs dropping the cues because they're not properly specified). |
@szaboa can you provide a pointer to the TTML document instance that claims to be an EBU profile and has an image in it? The link from #5066 (comment) does not have this. It is fine to mix vocabularies from different profiles, but doing so may result in a document that does not conform to any single one of those profiles (while still being a valid TTML document). There is not usually any conformance problem created by adding metadata vocabulary like The example in #5066 (comment) is a case in point - if a document has that URN in then it is claiming to be an EBU-TT-D document, but if it contains an |
So, I went through the IMSC and SMPTE specifications; source: The conclusion is the following: SMPTE: IMSC:
The smpte:backgroundImage attribute shall reference a PNG datastream. Based on a couple of example TTMLs this attribute has values like "<file_name>.png", so it's not containing a full path.
We do not aim to use these attributes in case of SMPTE at this point, so not a problem.
Right now we are searching for images only in divs, so not a problem right now. Currently on this pull request, we only support the SMPTE profile, and only if the @ojw28
Unfortunately no evidence, but we can follow-up on this after the profiles are clarified. @nigelmegitt And one additional question, in case of IMSC, what is the base URL of the smpte:backgroundImage? Or from where these PNGs can be resolved? Thank you both for the collaboration! |
I don't see any reason to do this unless someone actually needs support for it. Having to maintain features no-one is using isn't an attractive proposition for us :). |
@ojw28 I see your point and I agree. So the only thing that I need to finish here is to parse |
Maybe the decoder should output the cue with dimensions set to UNSET, and the painter should opt to do nothing in this case? It's probably fine to drop it in the decoder instead though, if that's easier. |
@ojw28 Also added a new test case where the regions are pixel defined. Test stream (it has sub tracks with both pixel / percentage regions): Ready for code review. |
When does this planned to be merged in? :) |
We'll try an get this in this week. |
|
||
if (isActive(timeUs)) { | ||
if (TAG_DIV.equals(tag) && imageId != null) { | ||
regionImageList.add(new Pair<>(resolvedRegionId, imageId)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is it OK to return here? Or do we really need to traverse (L252-254) even though we've found an image at this node?
We're ready to merge this except for the one question I just posted on the traversal code. If you could answer either way, that would be great! There's no need to push any update to the PR; we'll add the return during the merge, if your response indicates it makes sense for us to do so. Thanks! |
I am pretty sure it is OK to return there, I don't see why we would need to continue to traverse. |
When this PR is planned to merge, I need this patch for my project |
@getashok2003 |
No description provided.