Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Python environment activation #1791

Closed
mattseddon opened this issue May 30, 2022 · 18 comments · Fixed by #4045
Closed

Improve Python environment activation #1791

mattseddon opened this issue May 30, 2022 · 18 comments · Fixed by #4045
Assignees
Labels
A: integration Area: DVC integration layer A: onboarding Improving and simplifying users happy path. How do we get them have value asap? enhancement New feature or request priority-p1 Regular product backlog research

Comments

@mattseddon
Copy link
Member

mattseddon commented May 30, 2022

Currently for Python virtual environments we use ms-python.python's API to grab the currently selected Python interpreter and then we run:

$python_bin -c "import sys; print(sys.executable)"

This gives us the absolute path of the interpreter. We then call dvc with path/to/interpreter/python -m dvc with some args. We also prepend the directory that the python interpreter is in to the system $PATH variable before running the command. This works for the basic case of DVC being installed within the virtual environment and is the suggested approach from an ongoing github discussion.

Our current approach also works for global installs of DVC as long as the required dependencies are installed within the virtual environment. E.g for the demo project a global install of DVC does not work unless dvclive and ruamel.yaml are added to the virtual environment. I don't think there is a way around this for the brew installation. I looked into what can be done for a global install via pip but I can recreate the same behaviour via the terminal after activating the virtual environment.

I've been looking into how other extensions do this again and have found the following issues/discussions:

This seems to be an area that the vscode-python team is actively working on. We should keep an eye on those discussion tickets and see where they go. If they could expose the activation command in their API that would be ideal. Maybe we can get the correct PATH details from EnvironmentVariableCollection in the extension context.

@mattseddon
Copy link
Member Author

How this currently works

There are several options available to the user. We cater to the following combinations:

DVC is installed has virtual environment user wants to rely on ms-python.python for interpreter path internal dvc path internal python path call CLI with
inside environment yes yes dvc result of command*** /python/path -m dvc exp run
inside environment yes no dvc set by user /user/provided/path -m dvc exp run
global yes yes dvc result of command*** dvc exp run
global yes no dvc set by user dvc exp run
global no no dvc null dvc exp run
global yes yes user defined result of command*** /user/defined/path/dvc exp run
global yes no user defined set by user /user/defined/path/dvc exp run
global no no user defined null /user/defined/path/dvc exp run

*** /ms-python/provided/path/to/python -c 'import sys; print(sys.executable)'

When the CLI is installed globally it could actually be a binary so cannot call with /path/to/python -m dvc. When there is a virtual environment involved we work around this by taking the directory from the path to the interpreter and prepending it to the PATH environment variable.

Coverage

This works for normal basic use-cases of:

  • virtualenv
  • venv
  • pipenv
  • conda

Problems

However, this approach falls down when there is an activation script available (e.g not conda) and a user has added extra environment variables to that script and/or the package manager has its own special way of setting environment variables e.g conda's environment variables or pipenv's automatic loading of .env files.

The problem is not as simple as "just source the activation script" for the reasons stated above and because not all package managers have an activate script (e.g conda).

Proposed solution

I personally think the best thing to do right now would be to give the user the option to set their own additional environment variables for a DVC process (i.e anything that can require packages from a virtual environment). This does add to the setup overhead and will need to be documented BUT should only have to happen once per project per user. I think this would be a fair trade off and we may eventually be rescued by ms-python.python if they improve their API.

VS Code already let's users set these type of config variables for the integrated terminal:

image

WDYT @shcheklein?

@shcheklein
Copy link
Member

@mattseddon qq - how do we get the MS Python's interpreter path? Could you share what API doc / place in out code?

@mattseddon
Copy link
Member Author

mattseddon commented Jun 6, 2022

We use this API from the python extension: https://github.com/microsoft/vscode-python/blob/main/src/client/api.ts
We pull these details into our config: extension/src/config.ts through extension/src/extensions/python.ts

The part of the API that we use is this:

    const api: IExtensionApi = {
        // 'ready' will propagate the exception, but we must log it here first.
        ready: ready.catch((ex) => {
            traceError('Failure during activation.', ex);
            return Promise.reject(ex);
        }),

        ...

        settings: {
            onDidChangeExecutionDetails: interpreterService.onDidChangeInterpreterConfiguration,
            getExecutionDetails(resource?: Resource) {
                const pythonPath = configurationService.getSettings(resource).pythonPath;
                // If pythonPath equals an empty string, no interpreter is set.
                return { execCommand: pythonPath === '' ? undefined : [pythonPath] };
            },
        },

@shcheklein
Copy link
Member

@mattseddon thanks, have you found/checked by chance how does VS Code Terminal "understand" which script to run (how does communicate with MS Python)?

@mattseddon
Copy link
Member Author

mattseddon commented Jun 6, 2022

It is the other way around. MS Python sends messages to the terminal with sendText, it listens to events from the window to know when a new terminal is created.

These parts of the VS Code API that they use:

    /**
     * An individual terminal instance within the integrated terminal.
     */
    export interface Terminal {
    ...
        /**
         * Send text to the terminal. The text is written to the stdin of the underlying pty process
         * (shell) of the terminal.
         *
         * @param text The text to send.
         * @param addNewLine Whether to add a new line to the text being sent, this is normally
         * required to run a command in the terminal. The character(s) added are \n or \r\n
         * depending on the platform. This defaults to `true`.
         */
        sendText(text: string, addNewLine?: boolean): void;

        ...

        /**
         * An {@link Event} which fires when a terminal has been created, either through the
         * {@link window.createTerminal createTerminal} API or commands.
         */
        export const onDidOpenTerminal: Event<Terminal>;

    /**
     * Namespace for dealing with the current window of the editor. That is visible
     * and active editors, as well as, UI elements to show messages, selections, and
     * asking for user input.
     */
    export namespace window {

    ...

        /**
         * An {@link Event} which fires when a terminal has been created, either through the
         * {@link window.createTerminal createTerminal} API or commands.
         */
        export const onDidOpenTerminal: Event<Terminal>;

        ...

Finding anything in the vscode-python code base is difficult because all of the types are interchangeable interfaces and it does a lot of DI injection. src/client/terminals/activation.ts has a mention of onDidOpenTerminal which does lead to the window: https://github.com/microsoft/vscode-python/blob/0874c7815c6bf483d5826635c8eb45b3ab761365/src/client/terminals/activation.ts#L41-L42 but that could be a dead end.

@mattseddon
Copy link
Member Author

This could potentially change as a result of microsoft/vscode-python#11039.

@mattseddon
Copy link
Member Author

Regarding getting more information from ms-python.python: microsoft/vscode-python#18888 (reply in thread)

@shcheklein
Copy link
Member

Thanks, Matt, I think I have a better picture of all this stuff in my head now.

I personally think the best thing to do right now would be to give the user the option to set their own additional environment variables for a DVC process

It seems to be a reasonable approach. Nothing else besides just copy pasting the whole activation logic comes to my mind.

Some additional thoughts / ideas that I had but that would require going too deep.

  • can we listen to the same events as Terminal does
  • can we copy the activate logic with some reasonable overhead (we don't need the whole piece - it already identifies the environment, etc - that might simplify it for us)
  • in the terminal that runs when we tigger an experiment - can we write some additional information / hints if error happens

@shcheklein shcheklein added research A: integration Area: DVC integration layer labels Jun 8, 2022
@shcheklein shcheklein added the blocked Issue or pull request blocked due to other dependencies or issues label Oct 4, 2022
@erykml
Copy link

erykml commented Nov 12, 2022

Hi! I was directed here by @shcheklein. I am running into a problem with the DVC extension and virtual envs.

I have a codebase, and set up a dvc.yaml pipeline, which works. I tried dvc repro command (with the activated conda env) and everything succeeded. Then, I wanted to run an experiment and used dvc exp run, which also succeeded. Unfortunately, that is not the case when using the VS Code extension.

Then, I am getting an error ModuleNotFoundError: No module named 'src', which is weird as the codebase did not change between repro and exp run.

A potential culprit might be the pythonpath setting. In VS Code, I have an .env file with PYTHONPATH=src and in settings.json I have the following: "python.envFile": "${workspaceFolder}/.env". I added is as a suggestion from one of the users of DVC discord chat, but unfortunately it did not solve the problem.

To sum up, I am using macOS and working with a conda environment.

Any ideas about what might be the source of the issue and how to solve it?

EDIT:

I also tried to add the path to the Python interpreter into the settings.json file as follows:

{
    "dvc.pythonPath": "/opt/homebrew/Caskroom/miniforge/base/envs/dvc_test_2/bin/python"
}

@erykml
Copy link

erykml commented Nov 12, 2022

Another related issue that pops up while playing around with the extension is that once I try using the experiments tab (it does not work as mentioned above) to run an experiment, I can't run them anymore from the terminal using dvc exp run. I need to run export PYTHONPATH=$PWD in order to be able to run the experiments from the terminal again. This hints that the extension is unsuccessfully trying to set the path.

@shcheklein
Copy link
Member

@erykml

I can't run them anymore from the terminal using dvc exp run

Does it depend on whether extension is active or not?

Could you also please share the project layout and how do you run your script usually? I'm curios why do we need to do PYTHONPATH at all tbh. There are ways to do something like python train.py w/o providing PYTHONPATH.

@erykml
Copy link

erykml commented Nov 13, 2022

@shcheklein

After I tried running the dvc exp run command in the integrated terminal of VS Code after the failed run from the extension, then I am getting the error. I have killed the extensions terminal window by then. Does this answer your question?

The project's structure is as follows:

📦project_name
 ┣ 📂.dvc
 ┣ 📂data
 ┃ ┣ 📂processed
 ┃ ┣ 📂raw
 ┃ ┣ 📜.gitignore
 ┃ ┗ 📜raw.dvc
 ┣ 📂models
 ┣ 📂src
 ┃ ┣ 📂stages
 ┃ ┃ ┣ 📜data_split.py
 ┃ ┃ ┣ 📜eval.py
 ┃ ┃ ┗ 📜train.py
 ┃ ┗ 📂utils
 ┃ ┃ ┣ 📜__init__.py
 ┃ ┃ ┗ 📜params.py
 ┣ 📜.dvcignore
 ┣ 📜.gitignore
 ┣ 📜README.md
 ┣ 📜dvc.lock
 ┣ 📜dvc.yaml
 ┣ 📜metrics.json
 ┣ 📜params.yaml
 ┗ 📜requirements.txt

And normally I run a script as python src/stages/train.py.

@shcheklein
Copy link
Member

Got it.

This hints that the extension is unsuccessfully trying to set the path.

This is interesting. We'll check.

And normally I run a script as python src/stages/train.py.

thanks, it makes sense. Why do you need to use PYTHONPATH? I think you can import modules / scripts properly in the train.py so that PYTHONPATH is not required?

@erykml
Copy link

erykml commented Nov 13, 2022

Currently, I import the modules/scripts in the train.py as follows:

import pandas as pd
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier

from src.utils.params import load_params

@shcheklein
Copy link
Member

@SoyGema could you please also add a summary of the road block that you hit with the env vars / PYTHONPATH?

@SoyGema
Copy link

SoyGema commented Nov 21, 2022

CONTEXT : Reproducing the experiments for workshop , with dvc.yaml and params.yaml .

When executing dvc exp run train , the CLI is throwing the error:

Running stage 'train':                                                
> python  ExperimentsDVC/src/models/train_model.py
Traceback (most recent call last):
  File "/path/ExperimentsDVC/src/models/train_model.py", line 2, in <module>
    from src.data.load_dataset import read_data
ModuleNotFoundError: No module named 'src'
ERROR: failed to reproduce 'train': failed to run: python  ExperimentsDVC/src/models/train_model.py, exited with 1

That leads to set PYTHONPATH in CLI with export PYTHONPATH = 'path/ExperimentsDVC'
With this, dvc experiments is able to take the absolute path and train the experiment , showing metrics in table.

Then, pressing left button and executing Modify and Run UI feature , and changing some params , the following message is shown.

>>>> DVC Terminal >>>>

Running: dvc exp run -S params.yaml:train.n_estimators=150 -S params.yaml:train.max_depth=25 -S params.yaml:train.min_samples_split=10 -S params.yaml:load.data_path_train=ExperimentsDVC/data/raw/jan_train.csv -S params.yaml:load.data_path_test=ExperimentsDVC/data/raw/jan_test.csv -S params.yaml:featureselection.data=ExperimentsDVC/data/raw/jan_train.csv -S params.yaml:featureselection.features=["id","sat_id","x_sim","y_sim","z_sim","Vx_sim","Vy_sim","Vz_sim","Day","Hour","Minute"] -S params.yaml:featureselection.labels=["x","y","z","Vx","Vy","Vz"] -S params.yaml:splitdata.size=0.33 -S params.yaml:splitdata.state=42 -S params.yaml:standardscaler.data=null -S params.yaml:train.min_impurity_decrease=1

Running stage 'train':
> python  ExperimentsDVC/src/models/train_model.py
Traceback (most recent call last):
  File "/path/VSCode-DVC-Workshop/ExperimentsDVC/src/models/train_model.py", line 2, in <module>
    from src.data.load_dataset import read_data
ModuleNotFoundError: No module named 'src'
ERROR: failed to reproduce 'train': failed to run: python  ExperimentsDVC/src/models/train_model.py, exited with 1

Which seems similar to the previous error , which might lead to "assume" or "hypothesize" that the Workspace UI path config doesn´t relate to defined PYTHONPATH in CLI

CONFIG
Virtual env with requirements . Setup Workspace with DVC:Setup The Workspace > Auto . Bottom Left DVC Circle UI complete and bottom right Python seems correct 3.9.13('env': venv)

@mattseddon
Copy link
Member Author

Looks like PYTHONPATH is now exposed by the Python Extension: https://github.com/microsoft/vscode-python/blob/main/src/client/api.ts#L80

@shcheklein shcheklein added A: onboarding Improving and simplifying users happy path. How do we get them have value asap? priority-p1 Regular product backlog triage and removed blocked Issue or pull request blocked due to other dependencies or issues labels Dec 29, 2022
@mattseddon
Copy link
Member Author

Looks like PYTHONPATH is now exposed by the Python Extension: https://github.com/microsoft/vscode-python/blob/main/src/client/api.ts#L80

This is however marked as deprecated so we need to wait for it to get moved into stable before we go ahead and use it: https://github.com/microsoft/vscode-python/blob/main/src/client/api.ts#L30

@mattseddon mattseddon removed the triage label Jan 3, 2023
@mattseddon mattseddon added the blocked Issue or pull request blocked due to other dependencies or issues label Jan 3, 2023
@mattseddon mattseddon added this to To do in VS Code 31 Jan via automation Jan 31, 2023
@mattseddon mattseddon moved this from To do to In progress in VS Code 31 Jan Feb 2, 2023
@mattseddon mattseddon moved this from In progress to To do in VS Code 31 Jan Feb 2, 2023
@mattseddon mattseddon removed this from To do in VS Code 31 Jan Feb 2, 2023
@mattseddon mattseddon removed the blocked Issue or pull request blocked due to other dependencies or issues label Mar 21, 2023
@mattseddon mattseddon self-assigned this Jun 6, 2023
@mattseddon mattseddon added this to To do in VS Code May 23 via automation Jun 6, 2023
@mattseddon mattseddon moved this from To do to In progress in VS Code May 23 Jun 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: integration Area: DVC integration layer A: onboarding Improving and simplifying users happy path. How do we get them have value asap? enhancement New feature or request priority-p1 Regular product backlog research
Projects
No open projects
VS Code May 23
In progress
4 participants