Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport 2.3-maintainence] Fix "unexpected EOF" errors on macOS #9495

Draft
wants to merge 1 commit into
base: 2.3-maintenance
Choose a base branch
from

Conversation

Ericson2314
Copy link
Member

Motivation

Backport of legendary bugfix #8049 to 2.3.

Doesn't work yet, I am not sure why.

Context

This is needed so I can do #5650 which backports a number of test suite improvements that make it more robust --- and also more likely to catch the bugs that this catches. That is in turn needed so we can do better cross-version daemon testing to ensure protocol backwards compatibility works.

Priorities

Add 👍 to pull requests you find important.

Hopefully this fixes "unexpected EOF" failures on macOS
(#3137, #3605, #7242, #7702).

The problem appears to be that under some circumstances, macOS
discards the output written to the slave side of the
pseudoterminal. Hence the parent never sees the "sandbox initialized"
message from the child, even though it succeeded. The conditions are:

* The child finishes very quickly. That's why this bug is likely to
  trigger in nix-env tests, since that uses a builtin builder. Adding
  a short sleep before the child exits makes the problem go away.

* The parent has closed its duplicate of the slave file
  descriptor. This shouldn't matter, since the child has a duplicate
  as well, but it does. E.g. moving the close to the bottom of
  startBuilder() makes the problem go away. However, that's not a
  solution because it would make Nix hang if the child dies before
  sending the "sandbox initialized" message.

* The system is under high load. E.g. "make installcheck -j16" makes
  the issue pretty reproducible, while it's very rare under "make
  installcheck -j1".

As a fix/workaround, we now open the pseudoterminal slave in the
child, rather than the parent. This removes the second condition
(i.e. the parent no longer needs to close the slave fd) and I haven't
been able to reproduce the "unexpected EOF" with this.

(cherry picked from commit c536e00)
@roberth
Copy link
Member

roberth commented Nov 30, 2023

Doesn't work yet

IIRC this took multiple attempts to fix, so maybe we need more and/or different commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants