Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glib: disable use of close_range by spawn #112473

Closed
wants to merge 1 commit into from

Conversation

danielnachun
Copy link
Member

  • Have you followed the guidelines for contributing?
  • Have you ensured that your commits follow the commit style guide?
  • Have you checked that there aren't other open pull requests for the same formula update/change?
  • Have you built your formula locally with brew install --build-from-source <formula>, where <formula> is the name of the formula you're submitting?
  • Is your test running fine brew test <formula>, where <formula> is the name of the formula you're submitting?
  • Does your build pass brew audit --strict <formula> (after doing brew install --build-from-source <formula>)? If this is a new formula, does it pass brew audit --new <formula>?

I was looking more into the usage of close_range by glib and noticed that it is only used by gspawn, and its usage is gated by the HAVE_CLOSE_RANGE macro there. This is done because close_range was only recently added to glibc so many systems still don't support it. The code gracefully falls back to using older less elegant methods when close_range is unavailable (see https://gitlab.gnome.org/GNOME/glib/-/issues/2580).

I've confirmed through local testing that webkitgtk, which reproducibly fails because the close_range syscall is blocked in Docker, builds with no issues once glib has been rebuilt with this patch. To test this, I have added a temporary revision bump to webkitgtk so it is rebuilt. This should tell us if the fix worked as it should fail if it didn't.

@danielnachun danielnachun added the CI-no-fail-fast Continue CI tests despite failing GitHub Actions matrix builds. label Oct 6, 2022
@danielnachun danielnachun marked this pull request as draft October 6, 2022 06:51
@danielnachun
Copy link
Member Author

Although this timed out during dependent testing, as expected webkitgtk built successfully! I will drop the webkitgtk revision bump since it's not actually needed and see if it can finish the dependent testing in < 6 hours on Linux to make sure we didn't break anything there. The final run will need a long timeout on macOS to finish dependent testing there, but that's not worth doing until Linux looks good.

@BrewTestBot BrewTestBot added the python Python use is a significant feature of the PR or issue label Oct 6, 2022
@cho-m cho-m added long build Set a long timeout for formula testing CI-long-timeout [DEPRECATED] Use longer GitHub Actions CI timeout. CI-linux-self-hosted Build on Linux self-hosted runner labels Oct 7, 2022
@danielnachun
Copy link
Member Author

Linux-specific failures were all pre-existing, and mostly because of the missing qt bottle. The failures seen on all platforms are also pre-existing and unrelated.

@danielnachun
Copy link
Member Author

Importantly, we're not seeing any of the permissions failures we previously saw before, so this does seem like a good workaround for this until the Docker bug is fixed.

@cho-m cho-m removed the CI-long-timeout [DEPRECATED] Use longer GitHub Actions CI timeout. label Oct 8, 2022
@cho-m cho-m mentioned this pull request Nov 5, 2022
6 tasks
@github-actions
Copy link
Contributor

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@github-actions github-actions bot added the stale No recent activity label Nov 15, 2022
@github-actions github-actions bot closed this Nov 23, 2022
@chenrui333
Copy link
Member

Why didn't we pursue this?

@chenrui333 chenrui333 reopened this Nov 29, 2022
@chenrui333
Copy link
Member

@danielnachun can you also link the docker upstream issue tracker in here?

@chenrui333 chenrui333 added CI-long-timeout [DEPRECATED] Use longer GitHub Actions CI timeout. and removed stale No recent activity labels Nov 29, 2022
@chenrui333 chenrui333 marked this pull request as ready for review November 29, 2022 13:53
@chenrui333
Copy link
Member

rebased to the latest master and kicked a new run

@danielnachun
Copy link
Member Author

For reference, these two issues at least are directly relevant:
moby/moby#43595
moby/moby#42871

The long term solution that has not been merged yet is the second one which will by default return ENOSYS (which means the call is not available) when a syscall isn't yet supported by Docker's libseccomp profile instead of EPERM (which means the call is not permitted).

The reason this will fix these kinds of problems is that glibc normally never returns EPERM for "safe" operations like close_range, so user space software does not expect to receive that signal. Instead they check for an ENOSYS at run time to determine if the glibc being linked at run time supports that syscall and if that is no the case, they fall back to an older method that is supported.

I believe the current blocker to merging that is that there are some syscalls which are blocked for security reasons by default even if they are supported by glibc, and they need a way to conditionally differentiate the two scenarios which apparently is non-trivial to implement.

@Bo98
Copy link
Member

Bo98 commented Nov 29, 2022

For reference, these two issues at least are directly relevant:
moby/moby#43595

So the problem is not Docker but the seccomp installation on the host?

Is this only reproducible on linux-self-hosted-1? I can fix that to use 22.04.

@Bo98
Copy link
Member

Bo98 commented Nov 29, 2022

Homebrew/actions#320

@danielnachun
Copy link
Member Author

So the problem is not Docker but the seccomp installation on the host?

Is this only reproducible on linux-self-hosted-1? I can fix that to use 22.04.

It would be worth trying that, I didn't realize that we hadn't upgraded our host image there to 22.04. Do we know if that is the real bare metal host OS? My worry here is that it goes all way down to the bare metal host and I don't know if we can control that. But it would be fantastic if that fixed this!

@danielnachun
Copy link
Member Author

One important thing I just noticed based on this discussion - if you are using colima to run Docker locally, the Alpine host image is still using libseccomp 2.5.1 and seems not to have had close_range support backported. So the failure will occur when building locally but may not reflect what happens in CI, since it depends on the host libseccomp and not the one in the Docker image.

@danielnachun danielnachun added the automerge-skip `brew pr-automerge` will skip this pull request label Nov 29, 2022
@chenrui333 chenrui333 removed the CI-long-timeout [DEPRECATED] Use longer GitHub Actions CI timeout. label Nov 30, 2022
@cho-m
Copy link
Member

cho-m commented Nov 30, 2022

Homebrew/actions#320

Let's try running other CI-linux-self-hosted PRs to see if this worked. I am currently re-running a PR that modifies bzip (#113581) which had hit a number of related failures. Can also re-run the other glib PR #114049.

I think there are still some other self-hosted failures that are unrelated to this which would be nice to clean up. At the least some look SIMD related, but the self-hosted is now using Zen CPU which seems like it would be sufficient (previously I think I saw Broadwell or some similar Intel CPU).

@danielnachun
Copy link
Member Author

We should try it with the other glib because the patch here will obscure if we've actually fixed the problem.

@cho-m
Copy link
Member

cho-m commented Dec 1, 2022

At least in some other PRs, the errors that looked related to this (e.g. errors like Failed to close file descriptor for child process (Operation not permitted) for vala and bookloupe) seem to have gone away.

We will need to wait for long timeout slot to run other glib PR.

@cho-m
Copy link
Member

cho-m commented Dec 5, 2022

glib run looked good. Only a single error for fourstore related to #109825. https://github.com/Homebrew/homebrew-core/actions/runs/3616982087/jobs/6095493544

So this PR can probably be closed.

@cho-m cho-m added the superseded PR was replaced by another PR label Dec 5, 2022
@danielnachun
Copy link
Member Author

Yes, huge thanks for @Bo98 for noticing that the host image for our self-hosted runner was still 20.04. We'll need to add in our documentation that the host image for the self-hosted runner must be the same Ubuntu version or newer of the CI image we run inside of it.

@danielnachun danielnachun deleted the glib branch December 5, 2022 20:03
@github-actions github-actions bot added the outdated PR was locked due to age label Jan 5, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
automerge-skip `brew pr-automerge` will skip this pull request CI-linux-self-hosted Build on Linux self-hosted runner CI-no-fail-fast Continue CI tests despite failing GitHub Actions matrix builds. long build Set a long timeout for formula testing outdated PR was locked due to age python Python use is a significant feature of the PR or issue superseded PR was replaced by another PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants