Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix is_url from splitting the scheme incorrectly when using PEP 440's direct references #6203

Merged
merged 2 commits into from
Sep 16, 2019
Merged

Conversation

retpolanne
Copy link
Contributor

@retpolanne retpolanne commented Jan 25, 2019

Hello,

This PR fixes #6202 and includes tests for this issue.
When installing a .whl from a remote URL following this example,
pip @ https:///somewhere/pip-1.3.1-py33-none-any.whl

is_url was splitting the scheme incorrectly and it wouldn't recognize the line as a URL. Pip would try (and fail) to reference a local .whl file instead.

@retpolanne
Copy link
Contributor Author

Sorry for the newbie fails on the linting :(

@retpolanne
Copy link
Contributor Author

@uranusjr hi, could you please take a look at this PR?

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this approach is backwards: instead of treating @ in a URL-like string as a special case, the parser function should be able to exclude that case before the is_url check is even done.

I need to think about this more in detail to figure out what the right approach is, but this is probably not it.

@retpolanne
Copy link
Contributor Author

@uranusjr would it be a better idea to call split_scheme_from_url before calling is_url?

Or maybe use a regex to strip an URL from the line? (So you won't treat just the @)

return scheme in ['http', 'https', 'file', 'ftp'] + vcs.all_schemes


def split_scheme_from_url(url):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like generic parsing functions like this should go in misc.py with the other URL parsing functions. (Incidentally, I also think that path_to_url() and friends shouldn't be in download.py either.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at download.py, these URL related functions should stay together, either at misc.py or at download.py. But misc.py looks really polluted to me. Maybe creating an utils package that contains a file for URL related functions would be a better idea?

Copy link
Member

@cjerdonek cjerdonek Feb 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, probably. :) But I don’t want to sidetrack this PR further. For functions other than new functions you’re adding here, it would need to be done as a separate PR. I’m also not sure what type of function you’ll wind up needing after your conversation with @uranusjr. (I haven’t thought about it myself.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I would really like to tackle this up :)
I'll add this function to the misc.py then, if I still have to use it.

@uranusjr
Copy link
Member

I’m thinking maybe the function should be reorganised somehow. Instead of checking for URL-like, path-like, and finally as a name, it should check for name first (using PEP 508’s definition; maybe packaging.requirement would be useful?), and fallback to look for a URL or path afterwards.

@retpolanne
Copy link
Contributor Author

@uranusjr do you have any idea on how to do it this way? I could only think about doing it the other way around, by elimination (e.g. if something is not a file or a url, it is probably a name).

@uranusjr
Copy link
Member

A formal syntax definition is included in the PEP 508 document, and (I believe) implemented by _vendor.packaging.requirement.Requirement. If that’s the case, you can

  1. Split the markers (same).
  2. Try to parse as a name, and catch the exception if that fails.
  3. Try to parse as a URL, and then path (same).

I’m not sure if that would work, but it could be worth a try.

@retpolanne
Copy link
Contributor Author

@uranusjr thanks for the explanation. I'll try to understand this class more in-depth. I see that there are many characters and regex that are probably used for parsing here, e.g.

URI = Regex(r"[^ ]+")("url")

but I don't understand how pyparse works. Maybe the bug that I reported can be fixed here instead? (by adding proper validation)

@uranusjr
Copy link
Member

I don’t believe that matters, since the rule is only used as part of the name@url syntax. I tried it out a bit:

>>> Requirement('foo@https://[email protected]')
<Requirement('foo@ https://[email protected]')>
>>> Requirement('https://[email protected]')
Traceback (most recent call last):
[snipped]
pip._vendor.packaging.requirements.InvalidRequirement: Parse error at "':https://user@'": Expected stringEnd

So I think you can do something like

try:
    Requirement(name)
except InvalidRequirement:
    pass    # Maybe a nameless URL or a path
else:
    return ...   # Create InstallRequirement from name

if is_url(name):
    return ...   # Create InstallRequirement from URL

return ...  # Create InstallRequirement from path

@uranusjr
Copy link
Member

uranusjr commented Feb 15, 2019

Err I read the parser code in whole, and it’s… a mess 😭

Let’s start over. So the code currently parses like this:

  1. Does this look like a URL?
    a. Yes. Build a Link and go to 3.
    b. No. Go to 2.
  2. Does this look like a path (contains a path separator or starts with .)?
    a. Yes. Build a Link and go to 3.
    b. No. Treat it as a requirement string and go to 4.
  3. Parse the Link to get a package name, and go to 4.
  4. Build a requirement out of the information gathered.

The problem now is that PEP 440 URL reqs should go 1b-2b-4, but currently falls into 1b-2a-3-4. So we need to find a distinctive characteristic between a path and a PEP 440 URL req (the name req variant poses no problems), and fix the condition in 2.

URL_REQ = NAME "@" SCHEME ":" URI

# According to RFC 3986.
SCHEME = ALPHA *( ALPHA | DIGIT | "+" | "-" | "." )

# According to PEP 508.
NAME = LETTER_OR_DIGIT IDENTIFIER_END
IDENTIFIER_END = LETTER_OR_DIGIT | (('-' | '_' | '.' )* LETTER_OR_DIGIT

# I can't find the definition, but according to common sense?
LETTER_OR_DIGIT = ALPHA | DIGIT

We can conclude: A URL req must contain at least one @, and the part before the first must not contain a path separator, and must not start with ..

Now the fix becomes clear. The condition near line 235 should be modified to something like this:

def _looks_like_path(name):
    return (
        os.path.sep in name or
        (os.path.altsep is not None and os.path.altsep in name) or
        name.startswith('.')
    )

if is_url(name):
    link = Link(name)
else:
    ...
    elif is_archive_file(p):
        if os.path.isfile(p):
            link = Link(path_to_url(p))
        else:
            url_req_parts = p.split('@', 1)
            if not _looks_like_path(url_req_parts[-1]):
                logger.warning(...)

I know, this change make the code even more messy than before, but this is the best I can come up with without taking the whole thing apart 😞

@retpolanne
Copy link
Contributor Author

retpolanne commented Feb 17, 2019

I played with Requirement a little bit and couldn't find a way to use this

URL_REQ = NAME "@" SCHEME ":" URI

It looks like it parses named requirements here, which is what is expected
NAMED_REQUIREMENT = NAME + Optional(EXTRAS) + (URL_AND_MARKER | VERSION_AND_MARKER)

But when using the URL_REQ line, I kind of break it even more 😞 . Hopefully, it looks like unnamed requirements are passed as names, and not URLs.

I played a little bit with some validations:

(Pdb) URL_TEST = Optional(AT) + URI
(Pdb) URL_TEST.parseString('https://google.com')
(['https://google.com'], {'url': ['https://google.com']})
(Pdb) URL_TEST.parseString('google @ https://google.com')
(['google'], {'url': ['google']})

@uranusjr
Copy link
Member

uranusjr commented Feb 18, 2019

It occurs to me just now that we need another test case for URLs with authentication.

pip install https://user:[email protected]

Copy link
Member

@cjerdonek cjerdonek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments.

src/pip/_internal/req/constructors.py Show resolved Hide resolved
src/pip/_internal/req/constructors.py Outdated Show resolved Hide resolved
src/pip/_internal/req/constructors.py Outdated Show resolved Hide resolved
src/pip/_internal/req/constructors.py Outdated Show resolved Hide resolved
src/pip/_internal/req/constructors.py Outdated Show resolved Hide resolved
src/pip/_internal/req/constructors.py Outdated Show resolved Hide resolved
src/pip/_internal/req/constructors.py Outdated Show resolved Hide resolved
Copy link
Member

@cjerdonek cjerdonek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple more quick comments.

src/pip/_internal/req/constructors.py Outdated Show resolved Hide resolved
"Directory %r is not installable. Neither 'setup.py' "
"nor 'pyproject.toml' found." % name
)
if is_archive_file(path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use the "early return" pattern here again to reduce indentation by doing if not is_archive_file(path): and then returning None. Then the rest doesn't need to be indented.


@patch('pip._internal.req.req_install.os.path.isdir')
@patch('pip._internal.req.req_install.os.path.isfile')
def test_get_path_to_archive_pep440_url(isdir_mock, isfile_mock):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great to see you start writing these tests! A couple comments:

First, it's a helpful convention if when testing a function or method named my_method, the test function starts with the string test_my_method. That way it makes it easy to locate all the tests of a given function. So in this case, all of these should start with test_get_path_to_url_... (you don't need to include the leading underscore). Also, if you have more than one test function for a certain function, you can add a suffix describing the special case, like test_get_path_to_url__archive_pep440_url(). (I like to separate the function name portion from the suffix with a double underscore so someone can tell where the function name portion ends.)

Also, if you're testing multiple cases of a simple function, it helps to use @pytest.mark.parametrize to cut down on the amount of repetition. Take a look at test_make_vcs_requirement_url and the test functions following that for some examples. In this case, your inputs and outputs are strings (along with booleans to set your mocks), so it should be amenable to test parametrization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more comment: I like to put the test functions in the same order as the original functions appear in the module. This also makes it easier to locate test functions when you're scrolling around. The test module has a parallel structure to the module it's testing.

Copy link
Contributor Author

@retpolanne retpolanne Feb 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cjerdonek is it ok to use a noqa on test names? Just in case they get too big.

(Turns out I didn't need it.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect they would ever get too big. You can put the arguments on the next line if it ever started to get too long.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3e57673
There, I had some issues parametrizing the tests that had URLs though.

@cjerdonek
Copy link
Member

@vinicyusmacedo Are you still working on your changes, or were you waiting for another review? I noticed at least one (easy) comment wasn't addressed, which is why I was waiting.

@retpolanne
Copy link
Contributor Author

@cjerdonek sorry, I forgot about some of the comments. I'm pushing them right now and I think that's it :)

@cjerdonek
Copy link
Member

cjerdonek commented Mar 3, 2019

@vinicyusmacedo Can you also review the pip docs to see if anything needs changing / updating? For example, there is this part from the section on Requirements Specifiers that looks like it needs to be updated:

pip does not support the url_req form of specifier at this time

Maybe you can add a paragraph after the "Since version 6.0," paragraph saying, "Since version 19.1," describing the change you're adding.

@retpolanne
Copy link
Contributor Author

@cjerdonek requirements file format and examples need changing as well. Should I use Since version 19.1 as well in these parts?

@cjerdonek
Copy link
Member

@vinicyusmacedo You can leave out mention of the version for now in those other sections.

@cjerdonek
Copy link
Member

cjerdonek commented Mar 3, 2019

Does this mean you can also delete the parentheses here:

(pip does not support the url_req form of specifier at this time)

@cjerdonek
Copy link
Member

cjerdonek commented Mar 4, 2019

@uranusjr Now that the code and tests for this PR are more in shape, and because @vinicyusmacedo followed the approach that you suggested, can you review this carefully, and also see if any test cases are missing or should be added? Like, would it be good to have any test cases anywhere with a space missing before and/or after the @ symbol to make sure it's not interpreted as a direct reference?

@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label May 9, 2019
@retpolanne
Copy link
Contributor Author

@cjerdonek hey, sorry for bothering, is there anything that I missed on this PR?
I thinks that all sums up the fix.

@retpolanne
Copy link
Contributor Author

Ping @uranusjr

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use some squashing, but code-wise 👍

if os.path.altsep is not None and os.path.altsep in name:
return True
if name.startswith('.'):
return True
Copy link
Member

@uranusjr uranusjr Jul 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realised this if does not work as intended, and probably should be removed. A ./whatever string would’ve been caught in previous checks. This only matters for strings like .whatever, which I guess still does look like a path…? (but then the docstring is not accurate)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uranusjr that's exactly it. I don't really know why some package would start with ., but I'll add a test case for it and add it to the docstring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packages can’t start with a dot, so this check doesn’t really matter either way :p But it’s better to remove it since its mere existence can be confusing to future readers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I have removed it then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uranusjr oops, actually you can use . to install a package. You can use it to install the current directory as a package if it has a setup.py file.

Copy link
Contributor Author

@retpolanne retpolanne Jul 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have a problem, though: the Windows tests might fail with the == since the path separator is different. I'll go with name.startswith then (I could make separate test cases for Windows, but that doesn't sound so good).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the sep and altsep parts cover the different separators (if my memory of implementation from other projects serves).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understood from the docs, it appears to be only available on Windows (the altsep on Windows would be the forward-slash)

https://docs.python.org/3/library/os.html#os.altsep

Copy link
Member

@uranusjr uranusjr Jul 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct, hence the first test would detect \ on Windows and / on POSIX; the second test detects / on Windows (always false on POSIX).

You could add a simple Windows-only test like this, if you’re inclined to:

@pytest.mark.parametrize('path', [
     '.\\path\\to\\installable',
     'relative\\path',
     'C:\\absolute\\path',
 ])
@pytest.skipif(os.path.sep != '\\')
 def test_looks_like_path_win(path):
     assert _looks_like_path(path) == True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@uranusjr I could also skip this test if not sys.platform.startswith("win")

@xavfernandez
Copy link
Member

xavfernandez commented Jul 21, 2019

Thanks for sticking with it @vinicyusmacedo 👍

@retpolanne
Copy link
Contributor Author

@xavfernandez @uranusjr just added some Windows-specific tests.

@BrownTruck
Copy link
Contributor

Hello!

I am an automated bot and I have noticed that this pull request is not currently able to be merged. If you are able to either merge the master branch into this pull request or rebase this pull request against master then it will be eligible for code review and hopefully merging!

@BrownTruck BrownTruck added the needs rebase or merge PR has conflicts with current master label Jul 27, 2019
@pypa-bot pypa-bot removed the needs rebase or merge PR has conflicts with current master label Jul 27, 2019
@retpolanne
Copy link
Contributor Author

@xavfernandez is it possible to merge this one?

@desaintmartin
Copy link

Gentle up! How can we help to get this merged?

Copy link
Member

@chrahunt chrahunt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@chrahunt
Copy link
Member

I went through the existing comments and I believe all of them are addressed, so I will merge this. If I missed anything we can always address it in a followup.

Thanks for sticking with it @vinicyusmacedo!

@chrahunt chrahunt merged commit 82c2dd4 into pypa:master Sep 16, 2019
@retpolanne
Copy link
Contributor Author

@chrahunt thank you and everyone who reviewed this PR :)

@retpolanne retpolanne deleted the fix-pep-508 branch September 16, 2019 22:35
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Oct 16, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Oct 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pip fails to install remote dependency when it is a .whl and follows PEP 440
8 participants