Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support running DVC from a virtual environment, conda, etc #51

Closed
shcheklein opened this issue Dec 29, 2020 · 16 comments
Closed

Support running DVC from a virtual environment, conda, etc #51

shcheklein opened this issue Dec 29, 2020 · 16 comments
Labels
priority-p1 Regular product backlog

Comments

@shcheklein
Copy link
Member

We probably should take a look at how Python plugin is implemented, or similar plugins for that sake.

It's important to simplify debug (not need to install DVC from master globally), we'll need it anyway for a good user experience.

@rogermparent @RandomFractals @ryanraposo how do you guys run it now? what would it take to implement it? how should it look like in VSCode?

@shcheklein shcheklein added the priority-p1 Regular product backlog label Dec 29, 2020
@rogermparent
Copy link
Contributor

rogermparent commented Dec 29, 2020

Currently, at least in the DvcReader module that fetches table data, the extension looks for .env/bin/dvc and invokes it as the DVC binary if present. If that file isn't present, the extension simply uses dvc on the shell call which uses whatever DVC the user has on their PATH.

#28 and #26 some existing discussion on this.

I've used the current implementation on both virtualenv and conda, and it works as long as the environment folder is .env like all existing DVC projects I've looked at describe, as both tools insert the binary in the same .env/bin/dvc location. Aside from a feature that spawns a fully provisioned terminal which would require different commands, it seems like we're able to treat both env tools similarly by executing the env's binaries directly.

@ryanraposo has pointed out some APIs in the existing Python extension for VS Code that, if I understand correctly, we can lean on which would allow us to focus more on running DVC than building out a system to manage the user's Python environment.
I've seen some VSCode extensions like Live Share automatically install other extensions that they depend on, so it doesn't seem too out of the ordinary to leverage other extensions like this in our case.

@shcheklein
Copy link
Member Author

.env/bin/dvc and invokes it as the DVC binary if present

hmm, I thought that you would need env to be activated (PYTHON PATH is set properly, etc) for this to work. I'll try, thanks Roger.

has pointed out some APIs in the existing Python extension for VS Code that, if I understand correctly, we can lean on which would allow us to focus more on running DVC than building out a system to manage the user's Python environment.

good points! It should probably depend on those extensions/size vs how much code does it take to provide some basic environment support.

@RandomFractals
Copy link

RandomFractals commented Dec 30, 2020

Since current implementation of dvc data pipelines require python scripts, I think it would make sense to list Python as a dependency for dvc extension. This is done in package.json extension manifest via extensionDependencies or extensionPack elements, which basically install additional extensions, if they are not present, for your extension to function properly. In our case we'd require vscode Python to be installed. See docs for extension dependencies config:

https://code.visualstudio.com/api/references/extension-manifest#extension-packs

I don't think we should mock anything extra other than checking if dvc cli is installed and displaying info/warning message with a link to dvc installation docs for the dvc extension to function. That can be easily done by extension simply running dvc --version on activation to verify dvc command tool is installed and our extension is compatible with that version. Pretty much what gitlens does when you install it. It checks git version and prompts to install/update it.

@RandomFractals
Copy link

RandomFractals commented Dec 30, 2020

... as for required Python env. config dvc command might need to run python scripts we should be able to get Python settings from vscode.workspace.getConfiguration api and provide it to dvc cli or use it for running that command as needed.

See Python settings doc: https://code.visualstudio.com/docs/python/settings-reference

@shcheklein is that what you were looking for?

@shcheklein
Copy link
Member Author

@shcheklein is that what you were looking for?

https://code.visualstudio.com/docs/python/settings-reference

this link looks right to me (at least in the right direction). I'm still not sure how exactly should the workflow/settings look like in our case (e.g. should we take replicate some settings from the Python extension), or should we get is a dependency (I guess it's easier to start with a dependency).

Is there a settings/configuration page where I can pick the Python env I'd like to use (including DVC itself which might be installed in multiple places). To clarify there are a few different cases to my mind:

  1. The project itself is a Python project. User has to setup Python extension, venv or conda, or whatever even to run something from the project (no matter -DVC or not). In this case DVC might be installed as part of that venv, or might installed globally. If venv is activated I would expect us to detect it and prioritize DVC from that env before global.
  2. It's not a Python project at all (it's R, or bash, or whatnot). In this case there still can be some separate venv with DVC installed (but it doesn't belong to the project), or it can be installed globally, etc. We need a way to say which DVC to use.

@RandomFractals
Copy link

RandomFractals commented Dec 30, 2020

Those Python settings are accessible via File -> Preferences - > Settings menu in vscode if you navigate to Extensions -> Python in that view.

Regardless of what language DVC & data pipeline config requires, I think it should use default or configured language settings from vscode per user preference.

If you think we should have a DVC setting for its path, we can add it to our extension. I just don't know if we need to since DVC gets added to env. path for running it on win and other platforms, and seems to work just fine.

@shcheklein
Copy link
Member Author

Regardless of what language DVC & data pipeline config requires

DVC doesn't require any specific language. DVC is a command line tool (like Git). The only tricky part that it could be installed as a global package (Windows installer that you mention) or into a virtual env (like pip install dvc). In the latter case we should be able to find it, initialize the env to being able to run it, etc.

To be clear- I'm not a VS code expert, so, don't know how the exact solution should look like here. But it should cover all the possible scenarios in the most convenient and predictable way for VS code users.

E.g. if it's a Python project, virtualenv is set, VS Code itself might detect it (or Python extension). We should be able to detect if DVC is installed with pip here.

@RandomFractals
Copy link

RandomFractals commented Dec 30, 2020

well, dvc.yaml has python commands. so, I think it's valid to say DVC requires Python. If you later add R or node.js data piplelines support I expect dvc.yaml to contain commands for those interpreters as well. However, as long as we can run dvc --version check as I suggested above, we should be fine.

@shcheklein
Copy link
Member Author

well, dvc.yaml has python commands. so

they are not specific to Python already. Those could bash, R, anything else even today. The whole project can be written in R. It's like Make in this case- it's language agnostic. Project itself defines the environment, requirements, etc.

However, as long as we can run dvc --version check as I suggested above, we should be fine.

It might get complicated if DVC is installed in conda/virtualenv. I would not expect dvc --version to run as is. Some support is required from VS code. From what I see, it's not given "out of the box". E.g. in the current extension dvc exp run fails for me, but dvc exp list runs fine (since @rogermparent made it to look into .env. Again, I don't know what is the right solution here yet, I'm notifying about symptoms.

@RandomFractals
Copy link

RandomFractals commented Dec 30, 2020

I'll see what @rogermparent does for dvc exp list. Looking it up and hooking up .env is best.

In any case, I want to create a base DVCCommand class we can all use that runs these checks & wires launching dvc in one place. Should be part of #40

We might also consider creating a dialog to install DVC if we detect that it's not installed for the env. config you are trying to run it in.

@ryanraposo
Copy link

ryanraposo commented Jan 2, 2021

@shcheklein regarding: conda, python configs, the user flow, and our options. This might help connect some dots.

env-demo

  • Python 'project' opened in vscode for the first time
  • Env users know to select the right interpreter (status bar item)
  • .vscode folder appears in root, specifically to store 'settings.json' (workspace-scoped configurations)
  • The setting being stored is the "python.pythonPath". This is how VS Code keeps track of envs.
  • Users are also accustomed to conda and its envs being reflected in the integrated terminals, but that would be shell-side. (cmd.exe for example). For several reasons, we can't easily go that route for detection when it comes to our command execution needs.

Like @RandomFractals mentioned, we can easily retrieve that pythonPath. I'm not sold on a particular approach either, though.

EDIT: something worth taking a second to highlight: configurations are often dynamic like this and things like UI toggles, detections, etc are often hard-wired to them; changing the settings.json files behind the scenes.

We don't ever need to look down on defining a DVC path in that .vscode/settings.json, for example, because it doesn't mean its the whole story. We can have detection layers on top, and come away with users being able to override all of it.

@shcheklein
Copy link
Member Author

@ryanraposo thanks! Let me ask a few question to keep the discussion going and understand it better.

Python 'project' opened in vscode for the first time. Env users know to select the right interpreter (status bar item).

does it happen only if Python extension is installed?

Users are also accustomed to conda and its envs being reflected in the integrated terminals

I know about conda shell side (it is similar to any other virtualenv in that sense), but what about VS code? does it detect it same way?

The setting being stored is the "python.pythonPath". This is how VS Code keeps track of envs.

let's also keep in mind that this is a path for the project. It's useful to have access to it, but in general, DVC could be installed outside, e.g. globally - we should be able to detect that.

We don't ever need to look down on defining a DVC path in that .vscode/settings.json, for example, because it doesn't mean its the whole story. We can have detection layers on top, and come away with users being able to override all of it.

sounds good to me, and that's what I probably had in mind. The question here in the details - can we start outlining exact logic of that layer?

@ryanraposo
Copy link

ryanraposo commented Jan 2, 2021

@ryanraposo thanks! Let me ask a few question to keep the discussion going and understand it better.

@shcheklein Of course! It'll help me check myself on my own assumptions/gaps, too.

... select the right interpreter (status bar)
does it happen only if Python extension is installed?

Yeah, there won't be any detection of envs/interpreters, instead you'll get 1) a prompt to install it, which is enabled by 2) language detection for Python which is built-in.

no-python png

Users are also accustomed to conda and its envs being reflected in the integrated terminals
I know about conda shell side (it is similar to any other virtualenv in that sense), but what about VS code? does it detect it same way?

Right. So in my overview up there, I left that critical part out. And I should say: thats how VS Code keeps track of envs per project.

The setting being stored is the "python.pythonPath". This is how VS Code keeps track of envs.
let's also keep in mind that this is a path for the project. It's useful to have access to it, but in general, DVC could be installed outside, e.g. globally - we should be able to detect that.

So this is related to the part I left out. Right now, as far as I understand, the Python extension looks for:

  • Standard install paths such as /usr/local/bin, /usr/sbin, /sbin, c:\python27, c:\python36, etc.
  • Virtual environments located directly under the workspace (project) folder.
  • Virtual environments located in the folder identified by the python.venvPath setting (see General settings), which can contain multiple virtual environments. The extension looks for virtual environments in the first-level subfolders of venvPath.
  • Virtual environments located in a ~/.virtualenvs folder for virtualenvwrapper.
  • Interpreters installed by pyenv.
  • Virtual environments located in the path identified by WORKON_HOME (as used by virtualenvwrapper).
  • Conda environments that contain a Python interpreter. VS Code does not show conda environments that don't contain an interpreter.
  • Interpreters installed in a .direnv folder for direnv under the workspace (project) folder.

It's very capable, but yeah, this is all a means to the end of detecting DVC, and it can be installed independent of python/pip (right?)

We don't ever need to look down on defining a DVC path in that .vscode/settings.json, for example, because it doesn't mean its the whole story. We can have detection layers on top, and come away with users being able to override all of it.
sounds good to me, and that's what I probably had in mind. The question here in the details - can we start outlining exact logic of that layer?

I'm not able to include it here right this second--but I'll do that. It would be nice to have something to poke holes in, especially with your input!

EDIT: I'm pushed some code (#55 on dvc-detect-select) up to the point of detection so that we end up with a low-level plan. I want to merge it ultimately so that we can do a separate issue re: detections as methods.

@mattseddon
Copy link
Member

mattseddon commented Jan 22, 2021

I ran head first into this problem when trying to setup the demo/example-get-started sub-project to test out and have a look around the extension's codebase. Clicking the Run Experiment button within the View Tree window of the extension (in debug mode) was producing no extra results and a lot of errors in the debug console.

The initial error was:

> python src/featurization.py data/prepared data/features
Traceback (most recent call last):
  File "/Users/mattseddon/PP/vscode-dvc/demo/example-get-started/src/featurization.py", line 3, in <module>
    import pandas as pd
ModuleNotFoundError: No module named 'pandas'

This error occurs because pandas is not a direct dependency of dvc but is a dependency of the demo project. The error persisted after enabling the python extension (ms-python.python) and setting the correct venv (as discussed at length above). From debugging it appears that this is because the execCommand (shown below) has no way to directly hook into the python environment that is setup by the python extension.

const execCommand: (
  options: DVCExtensionOptions,
  command: string
) => Promise<{ stdout: string; stderr: string }> = ({ bin, cwd }, command) =>
  execPromise(`${bin} ${command}`, {
    cwd
  })

I confirmed this by adding source .env/bin/activate && to the function. After doing this everything started to work.

Sidebar: I then ran into another issue caused by dvc's Dulwich dependency (jelmer/dulwich#793) and contributed a fix which is now in master but not yet released.

The above solution is a very narrow use case within a huge range of possibilities for the project being setup.

I investigated hooking into the python extension and the functions that it exposes but they are very limited. The most useful option that I could find is python.createTerminal which creates a terminal and automatically activates the environment that you've selected. We can then pass commands to that terminal using sendText but there is no way easy way to track and parse the stdout provided by the terminal. We would have to go down the route of using the insiders version of vs code and the TerminalDataWriteEvent event in the proposed api. This would be a project in itself and I don't think that is a viable approach for an MVP. The reason that I mention this is that it solves the Run Experiment issue above but would not help us with running commands like dvc exp show --show-json (as we cannot get hold of the output).

Here is a screen recording of a working prototype:

Screen.Recording.2021-01-22.at.10.29.04.am.mov

After re-reading the thread and understanding that the bulk of dvc cli functions are standalone perhaps we could simply split out the ones that can be executed standalone from ones that can't and execute those behind the scenes.

This would also mean that we could easily and quickly make all of the functions shown in #40 available in the command palette and have them passed to a terminal, they should work as expected, give the user a feel for what they would see in the terminal if using the CLI and get them familiar with the mapping of "words to commands":
image

I do also understand that the underlying project could be in any language with its own environment which pushes me towards thinking that this only solves part of a much larger problem. If we break the problem down and identify core languages that we want to cater for and prioritise then execute on each we should be ok. Perhaps each language we want to support has a similar extension to ms-python.python which we can piggy back.

Is there a list of core languages that we want to support? Is the initial use case Python projects only?

One other option would be to open source / provide a vscode devcontainer per language and install all dependencies globally within the container. That would make a lot of these problems go away but brings in docker (and user knowledge of docker) as a dependency. I do have experience with this but not sure how big appetite / adoption of such an involved solution would be in your wider community / target audience. Keen to know everyone's thoughts on this. We would be able to include the extension as part of the devcontainer.json setup and know exactly where to expect everything is installed. Would this be something that we would even consider?

Apologies for the long post and thanks for reading. Happy to answer any follow up questions that anyone has.

Matt

@mattseddon
Copy link
Member

Here is a screen recording of a basic miniconda environment being activated throughout our test suite:

Screen.Recording.2021-02-05.at.2.51.51.pm.mov

This show that the built functionality holds true for both venv and conda environments.

@shcheklein would you be happy for me to close this one off now?

@shcheklein
Copy link
Member Author

This is perfect. Sure, let's close this. Thanks @mattseddon !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority-p1 Regular product backlog
Projects
None yet
Development

No branches or pull requests

5 participants