Fixes #244 when `get_path` is used on packed resources #246

ceresek · 2021-10-18T09:24:29Z

Resources get told their path in the constructor argument when created, but PackedResourceLibrary changes that path to a path computed by _get_path_in_cache later in the constructor. With SHA1 override in place, subsequent calls to the get_path method returned a different path than the resource was constructed with, resulting in an error.

This fixes the bug in _load_libraries. An alternative would probably be to apply the SHA1 override in _get_path_in_cache. Let me know if this sounds more logical ?

dougxc · 2021-10-18T23:08:41Z

Hi @ceresek , can you please rebase this PR on master.

ceresek · 2021-10-19T06:02:37Z

Sorry, done.

ceresek · 2021-11-10T09:10:43Z

@dougxc Just pinging :-)

dougxc · 2021-11-10T09:43:57Z

Sorry. I've started the process to merge.

zapster

Thanks @ceresek for pinpointing and fixing the issues. I've added inline comments.

zapster · 2021-11-10T09:52:41Z

mx.py

@@ -2376,7 +2376,7 @@ def _load_libraries(self, libsMap):
 deps = Suite._pop_list(attrs, 'dependencies', context)
 path = attrs.pop('path', None)
 urls = Suite._pop_list(attrs, 'urls', context)
- sha1 = attrs.pop('sha1', None)
+ sha1 = mx_urlrewrites.rewritesha1(urls, attrs.pop('sha1', None))


IMHO it is a bit unintuitive that we rewrite the sha1 here but not the urls and in addition rewrite both of them (again) in the constructors of the *Library instances. Even more worrying is that the _get_path_in_cache would be passed the rewritten sha1 but not the rewritten urls.

AFAICS, the result of the code that calls _get_path_in_cache...

mx/mx.py

Lines 2400 to 2405 in 17db0b9

if path is None and not optional and not (packedResource and "preExtractedPath" in attrs):

if not urls:

abort('Library without "path" attribute must have a non-empty "urls" list attribute or "maven" attribute', context)

if not sha1:

abort('Library without "path" attribute must have a non-empty "sha1" attribute', context)

path = _get_path_in_cache(name, sha1, urls, ext, sources=False)

... is only passed to the constructors:

mx/mx.py

Lines 2424 to 2429 in 17db0b9

if packedResource:

l = PackedResourceLibrary(self, name, path, optional, urls, sha1, **attrs)

elif resource:

l = ResourceLibrary(self, name, path, optional, urls, sha1, **attrs)

else:

l = Library(self, name, path, optional, urls, sha1, sourcePath, sourceUrls, sourceSha1, deps, theLicense, **attrs)

How about we move the code that requires the updated sha1 into the constructors as well? That way we have (1) precise control over which values (rewritten or not) are passed to _get_path_in_cache and (2) all the rewriting is done in one place.

That way we have (1) precise control over which values (rewritten or not) are passed to _get_path_in_cache and (2) all the rewriting is done in one place.

alternatively, we could move all the rewriting into _load_libraries which would have the same affect.

we could move all the rewriting into _load_libraries which would have the same affect

That sounds like the best approach to me as it prevents the un-rewritten urls from leaking out anywhere. Can you please try that change @ceresek .

Thank you for the tips, did the update in Suite::_load_libraries. More specifically:

removed rewrites inside Library (__init__ and get_urls)

removed rewrites inside ResourceLibrary (__init__ and get_urls)

implemented replacement rewrites inside Suite::_load_libraries

There are few remaining bits to maybe consider:

original rewriting in resource classes only took place after substVars which is now not the case. If we need that (my use case does not) then the rewriting probably has to happen later, inside the classes again. (Also, my earlier patch moved substVars so that it happened in the constructor rather than in get_urls methods, not sure if this can matter.)

Library::get_source_path rewrites source URL which I did not touch. I can add rewriting of source URL lists into _load_libraries too, and move it out of the get_source_path method, but my use case does not require that so I do not have a way to test it.

Repository::get_url does optional rewrite which I did not touch. Maybe it is now redundant.

parse_specification and get_source_urls in SuiteImport do rewrites which I did not touch.

Does this work for you ? (In particular, it should be tested on other scenarios that use URL rewrites, and the tests should make sure that rewriting really happens, I'm not sure if there are code paths that construct resource classes without going through _load_libraries.)

original rewriting in resource classes only took place after substVars which is now not the case. If we need that (my use case does not) then the rewriting probably has to happen later, inside the classes again. (Also, my earlier patch moved substVars so that it happened in the constructor rather than in get_urls methods, not sure if this can matter.)

Nice catch! That is a problem. This for example depends on substitutions before rewriting. 😞

Library::get_source_path rewrites source URL which I did not touch. I can add rewriting of source URL lists into _load_libraries too, and move it out of the get_source_path method, but my use case does not require that so I do not have a way to test it.

I'd move it to _load_libraries, but then the substVars issue applies as well.

Repository::get_url does optional rewrite which I did not touch. Maybe it is now redundant.

parse_specification and get_source_urls in SuiteImport do rewrites which I did not touch.

IMHO it is fine to keep those, at least for now.

The substVars is a blocker for this approach IMHO. The existing code is a bit of mess, sorry about that.

My current understanding is that: _get_path_in_cache requires urlrewrites which requires substVars which probably needs the constructor of the super class to be called to initialize the member variables. That would mean we should move the _get_path_in_cache code into the constructors. In addition, I think we should do the substVars and urlrewrites of source_urls also there. I think it does not make sense to store the un-substituted, un-rewritten source urls next to substituted and rewritten download urls.

That is quite a bit of refactoring. 😐

@ceresek if you are actively blocked by the missing feature, I'd also be fine with pushing your original suggestion and clean that up later.
@dougxc thoughts?

Your suggestions sound good to me @zapster and I'd really prefer to take the opportunity to clean this up now rather than leave another "broken window" lying around to bite us later. @ceresek would you be able to try make the suggested changes?

@zapster @dougxc Here is a version that fixes the original issue without much refactoring. In the end, I opted to keep _load_libraries and _get_path_in_cache without change, the rewriting happens entirely inside the library classes.

This change felt safer than the refactoring you mention - I simply was not sure I would get all the corner cases right.

Can you please take a look ?

Modulo the comment I just left about leaking values outside the Library constructor, I think your changes are fine as is. What do you think @zapster ?

dougxc · 2021-12-13T10:44:59Z

mx.py

@@ -8714,19 +8718,25 @@ def __init__(self, suite, name, path, optional, urls, sha1, sourcePath, sourceUr
 BaseLibrary.__init__(self, suite, name, optional, theLicense, **kwArgs)
 ClasspathDependency.__init__(self, **kwArgs)
 self.path = path.replace('/', os.sep) if path is not None else None
- self.urls = [self.substVars(url) for url in urls]
- self.sha1 = mx_urlrewrites.rewritesha1(self.urls, sha1)
+ self.urls = urls


Is there any point after this constructor where we need to access the original urls or the urlsSubstituted? It seems safer if we do not let these intermediate value escape outside the constructor so we should simply store the substituted and rewritten value in self.urls (same for self.sourceSha1).

I'll fix this as part of integrating the PR.

The urls are likely not needed, however, it might be useful to keep them in self.urlsRewritten rather than just self.urls to avoid leftover code using the rewritten values by mistake. The sha1 needs to stay in both original and rewritten form otherwise_get_path_in_cache will fail.

Thanks for looking into this !

The sha1 needs to stay in both original and rewritten form otherwise_get_path_in_cache will fail.

I'm not following. Is there a manual set of steps I can perform that will show me the problem?

To demonstrate, you would need to do an SHA1 rewrite on a PackedResourceLibrary resource. We do that in our infra for example with Renaissance 0.13 AOT runs, with this:

export MX_URLREWRITES='[ {"https://lafo.ssw.uni-linz.ac.at/pub/graal-external-deps/renaissance/renaissance-gpl-0.13.0.jar": {"replacement":"file:./../renaissance.jar","sha1":"5ee38f52755f2544e23a2d1bb9c86c3f0742915b"}}, {"https://lafo.ssw.uni-linz.ac.at/pub/graal-external-deps/renaissance/renaissance-harness_v0.13.0.tar.gz": {"replacement":"file:./../harness.tar.gz","sha1":"7029d2211235185ecd492b6092b95dbbaa5171d5"}} ]'

What happens is, mx will call _get_path_in_cache from _load_libraries to create the path argument for PackedResourceLibrary constructor. That path will include the original SHA1. Inside the constructor, SHA1 is rewritten and the path provided as the argument is used as extract_path, while path is computed again, this time with rewritten SHA1, through another call to _get_path_in_cache. This inconsistency (path and extract_path based on different values of SHA1) can later lead to crash in other mx code, in our case it was mx_substratevm_benchmark.py from Graal calling get_path, which in turn calls SafeDirectoryUpdater, which fails when trying to create a temporary directory because the parent does not exist.

Sorry for dragging this out but I'd really prefer not to do a half fix now and have to revisit this again in future to clean up the inconsistencies you mention. The high order point is that all substitution and rewriting of URLs and SHA1s should be done in one or as few places as possible. I think that's what @zapster refactoring suggestions were aiming at. In terms of testing, if you do the refactoring and it passes a full internal Graal gate (which I will definitely run), then I believe we're ok.
I assume that you must be using a patched mx locally to solve your immediate problem so I don't think there's an urgency to get a half fix in.

Next try :-) here is an update that moves all rewriting inside resource classes, and simplifies _load_libraries, as suggested by @zapster. There are only two places in the code now where resource URL rewriting happens (constructors of Library and ResourceLibrary), the original URL and SHA1 values are no longer kept around.

The code should be tested thoroughly - it works for me with Renaissance and AOT benchmarks, and it appears to work the same as current mx HEAD when building Graal CE, but there are features in the code that are not used by Graal CE.

WRT urgency, we have to use the most current mx with our AOT measurements, without some variant of this fix we do not run some AOT measurements (those that require alternative Renaissance harness to work around Scala library version mismatches). Eventually, these issues may go away due to planned changes to Renaissance, but if we could get some variant of this fix, that would certainly free our hands to focus on other issues.

…e URL versions around

dougxc · 2021-12-18T07:59:52Z

Merged as 7cbb641. Thanks for all the iterations on this @ceresek !

graalvmbot added the oca-signed label Oct 18, 2021

ceresek force-pushed the feature/sha1rewrite branch from 5ede8d9 to 17db0b9 Compare October 19, 2021 06:01

zapster reviewed Nov 10, 2021

View reviewed changes

zapster requested review from dougxc and gilles-duboscq November 10, 2021 10:10

ceresek force-pushed the feature/sha1rewrite branch from 17db0b9 to 24cc6d0 Compare November 24, 2021 15:29

ceresek force-pushed the feature/sha1rewrite branch 2 times, most recently from 797e80d to 59000da Compare December 9, 2021 07:17

dougxc reviewed Dec 13, 2021

View reviewed changes

Petr Tuma added 2 commits December 16, 2021 18:03

A fix to graalvm#244 that works with packed resources

5bb6a92

More extensive refactoring to fix graalvm#244 without keeping multipl…

cbcad84

…e URL versions around

ceresek force-pushed the feature/sha1rewrite branch from ab2db3d to cbcad84 Compare December 16, 2021 17:03

dougxc closed this Dec 18, 2021

graalvmbot merged commit cbcad84 into graalvm:master Dec 18, 2021

ceresek deleted the feature/sha1rewrite branch May 20, 2023 13:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #244 when `get_path` is used on packed resources #246

Fixes #244 when `get_path` is used on packed resources #246

ceresek commented Oct 18, 2021

dougxc commented Oct 18, 2021

ceresek commented Oct 19, 2021

ceresek commented Nov 10, 2021

dougxc commented Nov 10, 2021

zapster left a comment

zapster Nov 10, 2021

zapster Nov 10, 2021

dougxc Nov 10, 2021

ceresek Nov 24, 2021 •

edited

Loading

zapster Nov 25, 2021

dougxc Nov 25, 2021

ceresek Dec 8, 2021

dougxc Dec 13, 2021

zapster Dec 13, 2021

dougxc Dec 13, 2021

dougxc Dec 13, 2021

ceresek Dec 13, 2021

dougxc Dec 13, 2021

ceresek Dec 14, 2021

dougxc Dec 14, 2021

ceresek Dec 16, 2021

dougxc commented Dec 18, 2021

	if path is None and not optional and not (packedResource and "preExtractedPath" in attrs):
	if not urls:
	abort('Library without "path" attribute must have a non-empty "urls" list attribute or "maven" attribute', context)
	if not sha1:
	abort('Library without "path" attribute must have a non-empty "sha1" attribute', context)
	path = _get_path_in_cache(name, sha1, urls, ext, sources=False)

	if packedResource:
	l = PackedResourceLibrary(self, name, path, optional, urls, sha1, **attrs)
	elif resource:
	l = ResourceLibrary(self, name, path, optional, urls, sha1, **attrs)
	else:
	l = Library(self, name, path, optional, urls, sha1, sourcePath, sourceUrls, sourceSha1, deps, theLicense, **attrs)

Fixes #244 when get_path is used on packed resources #246

Fixes #244 when get_path is used on packed resources #246

Conversation

ceresek commented Oct 18, 2021

dougxc commented Oct 18, 2021

ceresek commented Oct 19, 2021

ceresek commented Nov 10, 2021

dougxc commented Nov 10, 2021

zapster left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ceresek Nov 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dougxc commented Dec 18, 2021

Fixes #244 when `get_path` is used on packed resources #246

Fixes #244 when `get_path` is used on packed resources #246

ceresek Nov 24, 2021 •

edited

Loading