Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Escape reserved path characters in the remote images post-transform #12253

Conversation

jayaddison
Copy link
Contributor

Feature or Bugfix

  • Bugfix

Purpose

  • Allow remote image URLs that contain port number specifications to be saved correctly during the post-transform steps on Windows.

Detail

  • Add the colon character (:) to the dictionary of character-to-path mappings in the ImageDownloader class; it isn't considered a valid path component character on Windows, meaning that attempts to create directories that include it in the name fail.

Relates

@jayaddison jayaddison marked this pull request as ready for review April 9, 2024 23:22
@jayaddison jayaddison changed the title Remote image post-transform: add a character-translation to support port number segments. Remote image post-transform: add a character-translation to support port number segments on Windows. Apr 10, 2024
@jayaddison
Copy link
Contributor Author

Perhaps this isn't ideal, particularly because it would affect the on-disk image download storage layout for non-Windows hosts. In particular: for hosts that had previously successfully image downloads in a http/localhost:7777/foo/bar directory path, then after this change, they would not discover that existing image content.

@jayaddison jayaddison marked this pull request as draft April 10, 2024 11:19
@jayaddison
Copy link
Contributor Author

Perhaps this isn't ideal, particularly because it would affect the on-disk image download storage layout for non-Windows hosts. In particular: for hosts that had previously successfully image downloads in a http/localhost:7777/foo/bar directory path, then after this change, they would not discover that existing image content.

I've adjusted the logic to retain the colon symbol on POSIX systems, meaning that cache re-use should be possible there. Existing cached content will not have existed on Windows systems for these URLs, because the directory names would have been considered invalid and not created in the first place.

I still think we should consider replacing this logic entirely with requests-cache or similar, as I've mentioned in the issue thread (#12100). However, that feels like a more significant change, and should be a separate pull request - either as an alternative to this approach, or a subsequent follow-up.

@jayaddison jayaddison marked this pull request as ready for review April 10, 2024 11:43
@AA-Turner AA-Turner changed the title Remote image post-transform: add a character-translation to support port number segments on Windows. Escape reserved path characters in the remote images post-transform Apr 12, 2024
@AA-Turner
Copy link
Member

I've added all of the reserved characters from https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file. This may be overkill, but better safe than sorry.

A

@AA-Turner AA-Turner merged commit e1d5235 into sphinx-doc:master Apr 12, 2024
23 checks passed
@jayaddison
Copy link
Contributor Author

I've added all of the reserved characters from https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file. This may be overkill, but better safe than sorry.

A

Thanks @AA-Turner. One thing I am slightly concerned about is that the : (and existing &) character is technically valid within the path segment of a URI -- so there's some, albeit maybe obscure, potential for websites that contain both foo/bar/image.png and foo:bar/image.png to resolve to the same image download cache filename here.

For long filenames, that's avoided due to the use of a sha1 on the complete URI - but that doesn't cover all cases.

I'd figured that perhaps adding colon-only, and only on Windows (where it hadn't been previously supported) would be a cautious improvement here -- but also that it could make sense to try to solve the problem more comprehensively in the not-too-distant future by using a well-tested library to handle all of this for us.

(you may already have understood / inferred all that, but I'd like to make sure)

@AA-Turner
Copy link
Member

We could simply always use the sha1, or a hybrid whereby we use the sanitised URI Authority and path + the sha of the full path?

A

@jayaddison
Copy link
Contributor Author

We could simply always use the sha1, or a hybrid whereby we use the sanitised URI Authority and path + the sha of the full path?

Ok, yep - it looks like I misunderstood some of the code; the sha1 for dirname (the post-translation node URI) occurs whenever the resulting string length is greater than 32. That's going to be the vast majority of cases, I think - especially because despite the variable name dirname, the value does include the filename part of the path (foo.png) (right?).

In that case, I think we should indeed always use sha1 on dirname -- removing the length conditional check -- and we should use the existing translations from v7.2.6 and earlier (? and &) so that cache re-use is maximized.

Does that seem safe and reasonable? (I'm not completely confident myself yet, this is relatively new code to me)

jayaddison added a commit to jayaddison/sphinx that referenced this pull request Apr 13, 2024
@jayaddison jayaddison deleted the issue-12100/image-url-to-directory-safechars branch April 16, 2024 23:37
@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 19, 2024
@AA-Turner AA-Turner added this to the 7.3.0 milestone Jul 13, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remote images: download of URIs containing port number details on Windows fails.
2 participants