Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: gotip-windows-arm64 builders stops working occasionally #66962

Open
cherrymui opened this issue Apr 22, 2024 · 10 comments
Open

x/build: gotip-windows-arm64 builders stops working occasionally #66962

cherrymui opened this issue Apr 22, 2024 · 10 comments
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@cherrymui
Copy link
Member

cherrymui commented Apr 22, 2024

#!watchflakes
default <- builder ~ `windows-arm64` && log ~ `fatal error: out of memory`

https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-windows-arm64
seems all builders are offline.

cc @golang/release @thanm

@gopherbot gopherbot added the Builders x/build issues (builders, bots, dashboards) label Apr 22, 2024
@gopherbot gopherbot added this to the Unreleased milestone Apr 22, 2024
@cherrymui cherrymui added NeedsFix The path to resolution is known, but the work has not been done. Builders x/build issues (builders, bots, dashboards) and removed Builders x/build issues (builders, bots, dashboards) labels Apr 22, 2024
@thanm
Copy link
Contributor

thanm commented Apr 23, 2024

I got an access grant, and logged into VMs to inspect them. Both were up (not hung or dead) but the "swarming" user was completely inactive (which is not supposed to happen if the systems are healthy). I inspected the system event logs but I don't see any red flags-- last entry in the logs for anything useful done by swarming is on Apr 7th, then after that the user just vanishes.

From the bot logs I see this in the Apr 7th swarming bot log ("C:\Users\swarming.swarming\logs\bot_stdout.log.1"):

Found a previous bot, 11832 rebooting as a workaround for https://crbug.com/1061531
Sleeping for 300 secs

We have SWARMING_NEVER_REBOOT set to true for these VMs, but the code in question doesn't seem to respect that.

Of course that doesn't explain why we would have two copies of the swarming bot running at the same time in the first place. Also a mystery as to why we don't get a proper auto-logon of the swarming user after this happens (since when I do manual restarts we don't seem to have this issue). If anyone has any ideas on how to debug this let me know.

I restarted both VMs and and they seem to be processing jobs again.

@dmitshur
Copy link
Contributor

From what I can tell, SWARMING_NEVER_REBOOT has effect for most frequent reasons that would otherwise cause the reboot to happen, but it doesn't catch all. The swarming bot seems to occasionally trigger a reboot in some edge cases.

We can try to catch and report those edge cases, and aim to get them fixed so the variable does as its name implies in all situations. There may still be future instances that get missed and a restart happens unintentionally anyway.

Other options include making this builder come back automatically after a restart, i.e., remove the need for setting the variable, and just handling the occasional restart manually when it happens.

Since the builders are now back online and working, let's close this particular issue. Thanks.

@cherrymui
Copy link
Member Author

It seems the builder stopped working again since earlier this week https://ci.chromium.org/ui/p/golang/builders/luci.golang.ci/gotip-windows-arm64
Thanks.

@cherrymui cherrymui reopened this May 10, 2024
@thanm
Copy link
Contributor

thanm commented May 10, 2024

I'll take a look. Wish I could figure out how to make this builder a bit more bulletproof.

@thanm
Copy link
Contributor

thanm commented May 10, 2024

OK, VMs restarted again. VMs were in the same state as last time, e.g. the

Found a previous bot, 11832 rebooting as a workaround for https://crbug.com/1061531

problem.

@thanm
Copy link
Contributor

thanm commented Jun 6, 2024

Still working on trying to make our LUCI windows-arm64 builders more reliable.

The latest set of problems here seem to relate to system oversubscription. I restart the VMs, and they run for a few days or a week, then at a certain point jobs launched to them wind up failing early with "out of memory" errors.

Sometimes the problems are in the LUCI infrastructure (ex: cas_download), e.g.

runtime: VirtualAlloc of 3710976 bytes failed with errno=1455
fatal error: out of memory
... 
runtime.(*mheap).alloc(0x38a000?, 0x1c5?, 0x58?)
    runtime/mheap.go:958 +0x54 fp=0x40037cca10 sp=0x40037cc9c0 pc=0x7ff6df88cad4

and sometimes the out of memory errors happen during test build:

    go_test.go:2588: go [build x] failed unexpectedly in C:\Users\swarming\.swarming\w\ir\x\w\goroot\src\cmd\go: fork/exec C:\Users\swarming\.swarming\w\ir\x\t\cmd-go-test-3753799972\tmpdir1058398312\testbin\go.exe: The paging file is too small for this operation to complete.

or this during a test run:

        fatal error: out of memory allocating heap arena map
        
        runtime stack:
        runtime.throw({0x7ff622467db4?, 0x0?})
            C:/Users/swarming/.swarming/w/ir/x/w/goroot/src/runtime/panic.go:1026 +0x38 fp=0x4e8f3ff4a0 sp=0x4e8f3ff470 pc=0x7ff621d45808
        runtime.(*mheap).sysAlloc(0x7ff6229de540, 0x101f?, 0x7ff6229ee980, 0x1)
            C:/Users/swarming/.swarming/w/ir/x/w/goroot/src/runtime/malloc.go:757 +0x348 fp=0x4e8f3ff560 sp=0x4e8f3ff4a0 pc=0x7ff621cdf6f8

I am not sure what could have changed with the VMs to start triggering these sorts of issues-- the swarming account logs seems to be clean for the most part, I don't see anything odd in the system event logs.

When I log into the builders and examine them, there are no tests running, but the system commit charge is at or near 100%. Pictures from process explorer:

azscreenshot
splitout

Paging @golang/windows experts -- if anyone has debugged these sorts of issues before and might have ideas on how to proceed, let me know (I am certainly out of ideas). My gut is that there is some sort of zombie process here, but given that I can't see any processes active from LUCI, I'm not sure how to debug this.

@alexbrainman
Copy link
Member

Sometimes the problems are in the LUCI infrastructure (ex: cas_download), e.g.

runtime: VirtualAlloc of 3710976 bytes failed with errno=1455
fatal error: out of memory
... 
runtime.(*mheap).alloc(0x38a000?, 0x1c5?, 0x58?)
    runtime/mheap.go:958 +0x54 fp=0x40037cca10 sp=0x40037cc9c0 pc=0x7ff6df88cad4

and sometimes the out of memory errors happen during test build:

    go_test.go:2588: go [build x] failed unexpectedly in C:\Users\swarming\.swarming\w\ir\x\w\goroot\src\cmd\go: fork/exec C:\Users\swarming\.swarming\w\ir\x\t\cmd-go-test-3753799972\tmpdir1058398312\testbin\go.exe: The paging file is too small for this operation to complete.

errno of 1455 is also ERROR_COMMITMENT_LIMIT The paging file is too small for this operation to complete. error - so both your LUCI infra and tests encounter the same error.

So I suspect your page file is too small or similar.

I am not an expert in this area anymore. I even doubt page file still exist on modern Windows. But I agree with you that Current Commit Charge of 100% on System Information program looks bad for your workload. Maybe 16 G of memory needs to be increased. Maybe there are different ways to add virtual memory.

I googled for "page file windows task manager", and I find

https://learn.microsoft.com/en-us/troubleshoot/windows-client/performance/introduction-to-the-page-file

but I cannot find any good suggestions there.

Hopefully other Windows experts will help.

Alex

@gdams
Copy link

gdams commented Jun 18, 2024

Hi there 👋🏼 I'm the Go group manager in Microsoft and I'd like to setup a call to discuss improving the stability of these builders. Who should I include? Thanks!

@thanm
Copy link
Contributor

thanm commented Jun 24, 2024

Hello @gdams on the goland core team side if you could please include myself (@thanm) and Dmitri (@dmitshur) that would be a good start. Thanks, Than

@dmitshur
Copy link
Contributor

dmitshur commented Oct 3, 2024

Recording here that there was another instance of the builders disconnecting and needing to be restarted around September 20-23:

image

Purple boxes are where it was missing until being restarted. The 3 failed builds before that all failed with "cannot allocate memory" errors:

Since the restart it's been working okay again. That suggests the work to have the builder fully start up after a restart is complete and working, and so it could work well to stop setting SWARMING_NEVER_REBOOT to allow LUCI restart the builders when it's deemed necessary.

@dmitshur dmitshur changed the title x/build: gotip-windows-arm64 builders stops working x/build: gotip-windows-arm64 builders stops working occasionally Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Builders x/build issues (builders, bots, dashboards) NeedsFix The path to resolution is known, but the work has not been done.
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

6 participants