Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-toolkit: only mount existing paths in the host #319772

Open
wants to merge 7 commits into
base: master
Choose a base branch
from

Conversation

ereslibre
Copy link
Member

Description of changes

Although the final fix for this kind of issues is #290609 (unbreak: we're currently mounting incomplete closures; need to use exportReferencesGraph.), this adds yet another fix for a symptom.

Fixes: #319201

I will address the exportReferencesGraph soon that I expect will fix all these problems once and for all, but fix this problem in the meantime.

Things done

  • Built on platform(s)
    • x86_64-linux
    • aarch64-linux
    • x86_64-darwin
    • aarch64-darwin
  • For non-Linux: Is sandboxing enabled in nix.conf? (See Nix manual)
    • sandbox = relaxed
    • sandbox = true
  • Tested, as applicable:
  • Tested compilation of all packages that depend on this change using nix-shell -p nixpkgs-review --run "nixpkgs-review rev HEAD". Note: all changes have to be committed, also see nixpkgs-review usage
  • Tested basic functionality of all binary files (usually in ./result/bin/)
  • 24.11 Release Notes (or backporting 23.11 and 24.05 Release notes)
    • (Package updates) Added a release notes entry if the change is major or breaking
    • (Module updates) Added a release notes entry if the change is significant
    • (Module addition) Added a release notes entry if adding a new NixOS module
  • Fits CONTRIBUTING.md.

Add a 👍 reaction to pull requests you find important.

@SomeoneSerge
Copy link
Contributor

@ereslibre may I ask you to instead consider whether it's possible to write a NixOS test covering all these symptoms we've had to address so far? E.g. maybe we could manually prepare a fake CDI json, and create temporary fake files in place of device nodes and userspace drivers, and then run a podman/docker container with a script ensuring that all of these fake paths are mounted correctly, including referential integrity (readlink -f /run/opengl-driver/lib/libcuda.so, etc)

This would be very useful for our migration to exportReferencesGraph...

@ereslibre
Copy link
Member Author

@ereslibre may I ask you to instead consider whether it's possible to write a NixOS test covering all these symptoms we've had to address so far?

I think this is a good idea. I have never written NixOS tests but I will take a look at existing ones and update the PR.

@ereslibre ereslibre marked this pull request as draft June 14, 2024 13:32
@SomeoneSerge
Copy link
Contributor

@ereslibre actually, I now think I shouldn't have delayed merging this; we'll still need the tests though

@ereslibre
Copy link
Member Author

@ereslibre actually, I now think I shouldn't have delayed merging this; we'll still need the tests though

Tests are WIP (ereslibre/nixpkgs@nvidia-ctk-mount-existing-paths-only...ereslibre:nixpkgs:nvidia-ctk-mount-existing-paths-only-wip); I'd like to finish them before 29th june. Otherwise I fear I'll have to resume second half of July. The "referential integrity" at this time is pretty "poor man"; not sure if you had any other approach in mind.

@ereslibre ereslibre force-pushed the nvidia-ctk-mount-existing-paths-only branch 2 times, most recently from e86c013 to 8d52c10 Compare August 16, 2024 18:37
@@ -32,6 +35,18 @@ function cdiGenerate {
--nvidia-ctk-path ${lib.getExe' nvidia-container-toolkit "nvidia-ctk"}
}

cdiGenerate | \
${lib.concatStringsSep " | " allJqMounts} > $RUNTIME_DIRECTORY/nvidia-container-toolkit.json
function additionalMount {
Copy link
Contributor

@SomeoneSerge SomeoneSerge Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well one comment is that this could be rewritten with pythonMinimal or a similar language that doesn't try so much to make this painful (you could remove concatMapStringsSep and just export a json for the script)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But as commented before, if you say this is the current iteration I'll go with it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'll work on a reimplementation of this logic with Python in future PR's, if that's fine with you.

@ereslibre ereslibre force-pushed the nvidia-ctk-mount-existing-paths-only branch from d223970 to 8e19362 Compare August 19, 2024 16:01
@ereslibre ereslibre force-pushed the nvidia-ctk-mount-existing-paths-only branch 2 times, most recently from b3ac4d8 to 9d382b5 Compare August 19, 2024 17:37
@ereslibre ereslibre marked this pull request as ready for review August 19, 2024 17:37
@ereslibre ereslibre force-pushed the nvidia-ctk-mount-existing-paths-only branch from 9d382b5 to d665ca4 Compare August 19, 2024 17:40
@ereslibre
Copy link
Member Author

@SomeoneSerge: is this good to merge on its current status?

Comment on lines 110 to 111
open = true;
package = config.boot.kernelPackages.nvidiaPackages.stable.open;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehhh it's stupid that open doesn't change the default to... an open driver

open = true;
package = config.boot.kernelPackages.nvidiaPackages.stable.open;
};
opengl.enable = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why Ofborg isn't complaining, we have a mkRenamedModule for this one

@SomeoneSerge
Copy link
Contributor

I pushed a few cosmetic changes, tell me if they're ok by you

@ereslibre
Copy link
Member Author

I pushed a few cosmetic changes, tell me if they're ok by you

Thank you @SomeoneSerge! LGTM :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nvidia-container-toolkit-cdi-generator should not mount nvidia-powerd when nvidia driver == 470
2 participants