Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create mousebender.resolve #105

Open
10 of 11 tasks
brettcannon opened this issue Aug 21, 2023 · 21 comments
Open
10 of 11 tasks

Create mousebender.resolve #105

brettcannon opened this issue Aug 21, 2023 · 21 comments

Comments

@brettcannon
Copy link
Owner

brettcannon commented Aug 21, 2023

Use resolvelib to keep compatibility with other projects like pip and PDM.

Provide flexibility in how the resolution is handled:

  • Separate out how metadata is fetched (so you can swap out either the resolution or download logic; also makes testing easier)
  • Track which projects brought in which dependencies (to be able to render the dependency graph)
  • Newest or oldest version (to know when your lower bounds no longer work)
  • Pass in platform and marker details (so you can resolve for any platform on any machine)
  • Fastest or most compatible wheel (so you can see what the loosest wheel tag triple you need is)
  • Support external constraints (since it seems to be a common need)

Provider protocol

@brettcannon
Copy link
Owner Author

brettcannon commented Aug 21, 2023

Code links:

@pradyunsg
Copy link
Contributor

pradyunsg commented Sep 1, 2023

A couple of relevant sections from existing standards:

@brettcannon
Copy link
Owner Author

The identify() method seems to be about generating a key for a requirement (typing suggests it can be a candidate or requirement, but the docstring only mentions requirements):

Given a requirement, return an identifier for it.

This is used to identify a requirement, e.g. whether two requirements
should have their specifier parts merged.

So the important thing here seems to be having requirements that are for the same thing to line up for merging.

Apparently everyone makes extras part of key; pip, PDM, and the extras example from resolvelib. What's interesting is PDM and pip return a string while the resolvelib example returns a string or tuple of name and extras.

Since the typing of identify() suggests nothing more than it must be hashable, probably the simplest solution here is to return a tuple of Tuple[NormalizedName, frozenset[NormalizedName]] for the distribution name and extras. For any requirements that have no extras then the empty set can be used.

Now we just have to figure out the representation of a requirement in order to generate that tuple. 😅

@brettcannon
Copy link
Owner Author

The get_dependencies() method seems to be about taking a candidate (which is distribution and version) and getting back its dependencies in the form of a requirement:

Get dependencies of a candidate.

This should return a collection of requirements that candidate
specifies as its dependencies.

Depending on how candidate objects look, this could be very easy or it could require going back to the wheel to get the dependencies.

Since the resolver is the one having to figure out priorities, the dependencies probably don't have a priority order, thus returning something like a set should be okay.

@brettcannon
Copy link
Owner Author

The is_satisfied_by() method seems to just check if a candidate satisfies a requirement.

Whether the given requirement can be satisfied by a candidate.

The candidate is guaranteed to have been generated from the
requirement.

A boolean should be returned to indicate whether candidate is a
viable solution to the requirement.

Now the interesting thing is "the candidate is guaranteed to have been generated from the requirement", and yet we still have to check if the candidate satisfies the requirement. 🤔 The best I can think of is multiple requirements got merged into a single requirement and this is to check if the candidate is still viable.

@brettcannon
Copy link
Owner Author

The get_preference() method is all about creating a sort key for the DSU pattern.

The preference is defined as "I think this requirement should be
resolved first". The lower the return value is, the more preferred
this group of arguments is.

The PyPI wheel example from resolvelib and an open issue about a default implementation suggest just going with the number of candidates as that prioritizes the most restricted distributions first.

Pip's implementation is way more involved. From its docstring:

  • Prefer if any of the known requirements is "direct", e.g. points to an
    explicit URL.
  • If equal, prefer if any requirement is "pinned", i.e. contains
    operator === or ==.
  • If equal, calculate an approximate "depth" and resolve requirements
    closer to the user-specified requirements first. If the depth cannot
    by determined (eg: due to no matching parents), it is considered
    infinite.
  • Order user-specified requirements by the order they are specified.
  • If equal, prefers "non-free" requirements, i.e. contains at least one
    operator, such as >= or <.
  • If equal, order alphabetically for consistency (helps debuggability).

PDM's implementation is also more involved. The return value gives a hint as to what it prioritises:

return (
            not is_python,
            not is_top,
            not is_file_or_url,
            not is_pinned,
            not is_backtrack_cause,
            dep_depth,
            -constraints,
            identifier,
        )

It might be best to go with the simple solution as suggested in the open issue and example and increase complexity as necessary.

@brettcannon
Copy link
Owner Author

If get_dependencies() is candidate -> requirements, the get_matches() method seems to be requirement -> candidate.

This is seems to be where communicating with e.g. PyPI would come in (as the PyPI wheel example suggests).

This does require returning things in a sorted order of preference.

@brettcannon
Copy link
Owner Author

After all of those notes, it seems like the methods do the following (with a wheel-specific view):

  • identify(): Key for a dict for a requirement which is consistent based on the distribution and not any constraints
  • get_preference(): Sort requirements based on how important they are to resolve for first
  • find_matches(): Find all the candidates that could satisfy a requirement
  • is_satisfied_by(): Does a candidate still work for a requirement?
  • get_dependencies(): Get the dependencies of a candidate

It seems all the magic in terms of what to customize happens in find_matches(). I think the tricky bit here is the right abstraction to let things be customized while also helping to minimize network traffic. Do you do subclassing? Composition? Do you ask for everything and let the common code do the filtering of the results, or do you let the thing you delegate to handle that to avoid needless networking requests? Another question is do you return all potential wheels, or just the best one (e.g., compiled wheel and pure Python wheels, or just the compiled one)? If you do all wheels you have the greatest chance at success, but it will also increase resolution complexity. It also then becomes a question of whether you prefer newer versions with only pure Python wheels or older versions with compiled wheels?

It seems like requirements are pretty much what packaging.requirements.Requirement are. Candidates are details about a specific wheel.

@pradyunsg
Copy link
Contributor

requirements that are for the same thing to line up for merging.

Same for the corresponding candidates -- for every candidate generated by a find_matches(requirement), it needs to have identify(candidate) == identify(requirement). Without that, the resolver cannot associate the two.

I don't remember if we added an explicit error check for this in resolvelib, but flagging this since I don't see this nuance mentioned here.

PS: We need to improve that docstring for identify. 😅

@brettcannon
Copy link
Owner Author

Same for the corresponding candidates -- for every candidate generated by a find_matches(requirement), it needs to have identify(candidate) == identify(requirement). Without that, the resolver cannot associate the two.

How does that play into extras? The extras example says:

To model projects with extras, we define a candidate as being a project with a
specific set of dependencies. This introduces a problem, as the resolver could
produce a solution that demands version 1.0 of X[foo] and version 2.0 of
X[bar]. This is impossible, as there is actually only one project X to be
installed. To address this, we inject an additional dependency for every
candidate with an extra - X[foo] version v depends on X version v. By doing
this, we constrain the solution to require a unique version of X.

So does that mean you have the same candidate wheel under multiple identifiers to satisfy any and all extras that are possible for a distribution (and thus cache a lot to avoid the cost of looking up the same thing for a distribution just because you now come across a new set of extras)? Or do you add a dependency to the distribution+extra of the distribution itself and have a virtual candidate that always satisfies the distribution+extra so the resolver effectively ignores the distribution+extra node of the dependency graph?

@brettcannon
Copy link
Owner Author

PS: We need to improve that docstring for identify. 😅

sarugaku/resolvelib#138

@pradyunsg
Copy link
Contributor

pradyunsg commented Sep 10, 2023

So does that mean you have the same candidate wheel under multiple identifiers to satisfy any and all extras that are possible for a distribution (and thus cache a lot to avoid the cost of looking up the same thing for a distribution just because you now come across a new set of extras)?

For pip, yes-ish (-ish because I don't recall the caching details). The extra candidate is based on another candidate and the way that this is handled is by having the extra candidate depend on a requirement that only find_matches with the specific candidate for the underlying distribution file which the metadata was pulled from.


For something like the following...

[project]
name = "home"
version = "1.0.0"

[project.optional-dependencies]
base = ["base-dependency"]
dog = ["dog-food", "dog-bed"]
cat = ["cat-food", "cat-bed"]

And doing a home[base, dog] and home[base, cat] will give resolvelib a graph that looks something like:

graph LR
    subgraph legend
        requirement(["requirement"])
        candidate[["candidate"]]
    end
    subgraph cat-extra["identity = home[base,cat]"]
        home-cat-requirement(["home[base,cat]"])
        home-cat-version-one[["ExtrasCandidate(\n  LinkCandidate(home-1.0.0-*.whl),\n  [base,cat]\n)"]]
        style home-cat-version-one text-align:left
    end

    subgraph dog-extra["identity = home[base,dog]"]
        home-dog-requirement(["home[base,dog]"])
        home-dog-version-one[["ExtrasCandidate(\n  LinkCandidate(home-1.0.0-*.whl),\n  [base,dog]\n)"]]
        style home-dog-version-one text-align:left
    end

    subgraph cat-requirements ["layout grouping\ncat extra dependencies"]
        cat-requirement-food(["cat-food"])
        cat-requirement-bed(["cat-bed"])
    end

    subgraph base-requirements ["layout grouping\nbase extra dependencies"]
        base-requirement(["base-dependency"])
    end
    subgraph home-one ["identity = home"]
        home-requirement-one(["ExplicitRequirement(LinkCandidate(home-1.0.0-*.whl))"])
        home-version-one[["LinkCandidate(home-1.0.0-*.whl)"]]
    end
    subgraph dog-requirements ["layout grouping\ndog extra dependencies"]
        dog-requirement-food(["dog-food"])
        dog-requirement-bed(["dog-bed"])
    end


    home-dog-requirement --> home-dog-version-one;
    home-cat-requirement --> home-cat-version-one;

    home-dog-version-one --> home-requirement-one;
    home-dog-version-one --> base-requirement;
    home-cat-version-one --> home-requirement-one;
    home-cat-version-one --> base-requirement;

    home-requirement-one --> home-version-one;

    home-cat-version-one --> cat-requirement-food;
    home-cat-version-one --> cat-requirement-bed;

    home-dog-version-one --> dog-requirement-food;
    home-dog-version-one --> dog-requirement-bed;
Loading

And, yes, I've not expanded the requirements that came from the extras into their corresponding candidates because that isn't relevant to the extra handling question.

(A home 2.0.0 will add a new node corresponding to those for 1.0.0, with basically the same connections if the dependencies are unchanged. Adding that in made mermaid render the graph in an unreadable format so I've skipped it)

Edit: Updated to explicitly reference the pip-side class object names.

@brettcannon
Copy link
Owner Author

@pradyunsg thanks for this! So if I understand this correctly, there are pseudo-candidates per extras set and distro version which ends up with an implicit requirement to the distribution itself -- i.e. no extras -- with a == to the same version the pseudo-candidate specified.

@pradyunsg
Copy link
Contributor

So if I understand this correctly, there are pseudo-candidates per extras set and distro version which ends up with an implicit requirement to the distribution itself -- i.e. no extras -- with a == to the same version the pseudo-candidate specified.

Yes, except that it's not done via == but as an explicit requirement that matches the candidate. This matters for things like:

>>> SpecifierSet("==1.0").contains("1.0+local_version_label") 
True

Where the 1.0 wheel can be different from 1.0+local_version_label wheel.


I'm realising that this can (and does) lead to the resolver backtracking on the wrong requirement sometimes, since backtracking when an ExplicitRequirement is involved is going to result in suboptimal backtracking. The fix there is to prefer to backtrack on ExplicitRequirement whenever possible.

@brettcannon
Copy link
Owner Author

backtracking when an ExplicitRequirement is involved is going to result in suboptimal backtracking. The fix there is to prefer to backtrack on ExplicitRequirement whenever possible.

That sounds contradictory. If backtracking when an explicit requirement is involved is suboptimal, why is the fix to prefer backtracking in that case? Are you saying explicit requirements should come earlier or later in the ordering provided by get_preferences() (which is what I assume you're suggesting as the mechanism to signal what to backtrack on)?

@brettcannon
Copy link
Owner Author

And now I'm seeing why sarugaku/resolvelib#14 exists. 😅

@pradyunsg
Copy link
Contributor

Are you saying explicit requirements should come earlier or later in the ordering provided by get_preferences() (which is what I assume you're suggesting as the mechanism to signal what to backtrack on)?

Yea -- the preference for backtracking on the ExplicitRequirement should be higher that other kinds of requirements, when it is implicated in a conflict.

@brettcannon
Copy link
Owner Author

Started outlining the code in the 'resolve' branch.

@brettcannon
Copy link
Owner Author

The resolver now works! I haven't done constraints, but it works well enough that I have a PoC resolver that works with PyPI metadata that can be directly downloaded.

If anyone has feedback on the high-level API I would appreciate it since that's the bit that can't be easily changed later.

@ofek
Copy link

ofek commented Feb 11, 2024

Do you mind to briefly explain how the PoC differs from the resolver in pip?

@brettcannon
Copy link
Owner Author

Do you mind to briefly explain how the PoC differs from the resolver in pip?

At minimum, the blatant differences are:

  • No support for sdists
  • Not support for constraints

There's probably other subtleties that I'm not aware of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants