Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

amdgpu and drm drivers version #150

Open
kotee4ko opened this issue Oct 21, 2023 · 10 comments
Open

amdgpu and drm drivers version #150

kotee4ko opened this issue Oct 21, 2023 · 10 comments

Comments

@kotee4ko
Copy link

kotee4ko commented Oct 21, 2023

Hello.
Thanks for your work.

May I ask, how to enshure, that Im running latest GPU driver?
Question may be strange a bit, but in kmsg I see, that amdgpu dated as 2016.
The kernel from this repo, was olddefconfigured and compiled without keysigns.
Modules were built and installed.
Initramfs was packed, grub was updated, and system (which is remote server) was successfully booted up.
uname -a say, that it running on custom 6.2.8.

I using radeon v340 pro. Twin gpu, 16gb of vram per core. gfx900, vesa 10.
And my ultimate goal to allow cross-device vram access.

At the moment, I got crash somewhere in libhipamd64.so, after entering hipMemcpy2DAsync from userspace.

I spent ~20 hours in attempts to build hipamd, which was moved several times, and now been part of crt. But lack of fresh documentation and my expiriense with HIP in genaral stuck me.

Can you, please, hint me what should I do to:

  1. enshure that I running latest driver and firmware for GPU.
  2. enshure that kernel configured to allow fast vram dma, and cross-gpu dma.
  3. maybe, any other hint how to deal with crashes in libhipamd64, or way to build all sdk in less painfull way.
  4. under the rocgdb I could see memory of src and dst, even change it. But it (target code, not rocgdb itself) crash because of nullptr deref somewhere deep inside of hipMemcpy2DAsync call.

Btw, as for attempts to build from dkms - I give a try to build on fresh ubuntu installation, on debian, on this kernel too -- no one was build successfully from dkms.

Best regards,

@kotee4ko
Copy link
Author

@kentrussell, Sir?

@kentrussell
Copy link
Contributor

If you're using the ROCm install, you can check "dkms status" to return the version of the kernel installation package. If you're using the stock Ubuntu kernel, then that will depend on the version of Ubuntu that you're using. Conversely, "dpkg -l|grep amdgpu-dkms" will also show the version of the kernel+firmware.
If your error isn't ending up in dmesg, but is isolated to the userspace interactions around hipMemcpy2DAsync, then the HIP guys might be able to help to figure that out (especially if rocgdb is throwing a null pointer, which is definitely not good, ). If dmesg is printing anything around the crash, then please attach that log since that will hopefully give some useful information to go off of.

Lastly, what problems did you have with building from dkms? I've currently got the ROCm kernel installed on UB22.04.1 (6.2 kernel) via dkms without any issue

@kotee4ko
Copy link
Author

Thanks.

What about dkms - it wouldn't build if current kernel is ROCK kernel.
On stock ubuntu kernel I was able to build and taint latest driver.

What about hipMemcpy2DAsync - it seems, that there are some difference between cuda and hip behaviour when processing p2p memory.

For example, when active GPU is 1, attempt to copy/memset memory allocated on GPU 0 would crash. At least for gfx900.

Moreover it is quite unclear how to enable p2p, and if it ever possible.
On LKM side the only option I found is setting vm_mode to 3 to force both access types, even then isPeerCanAccess(1,0) return false.

Currently the code which is working (well, at least code which doesn't triggering SIGSEGV) for cuda users - trigger it on gfx900.

ggerganov/ggml#590

If this subject going away from amdgpu - feel fre to close an issue.
Thank you again.

@kentrussell
Copy link
Contributor

amdgpu-dkms would have all of the P2P features in the kernel enabled, as would the monolithic kernel built from this repo. The stock Ubuntu OS kernel wouldn't have IPC or RDMA support, but for your situation, straight P2P (1 host, 2 GPUs), P2P should be working in the kernel without issue.
I'd raise the issue in the HIP repo at https://github.com/ROCm-Developer-Tools/HIP , since I'm less familiar with that code and they should be able to help you to figure out the differences in hipMemcpy2DAsync , as well as helping to isolate the source of the crash. If there's anything useful in dmesg, I can take a look and see if there's anything obvious there, but from my initial pass, it looks like this might be a HIP bug and not a kernel bug, Regardless, I'll leave this open until I can check your dmesg (from after the crash) to see if there's anything that stands out to me there.

@kotee4ko
Copy link
Author

The only what I see in dmesg is WARN_ONCE() which reasonable warn me about large BAR system.


Oct 24 04:39:18 ai-dev kernel: ------------[ cut here ]------------
Oct 24 04:39:18 ai-dev kernel: CPU update of VM recommended only for large BAR system
Oct 24 04:39:18 ai-dev kernel: WARNING: CPU: 16 PID: 2250 at /var/lib/dkms/amdgpu/6.2.4-1664922.22.04/build/amd/amdgpu/amdgpu_vm.c:2316 amdgpu_vm_make_compute+0x113/0x2d0 [amdgpu]
Oct 24 04:39:18 ai-dev kernel: Modules linked in: intel_rapl_msr intel_rapl_common sb_edac ipmi_ssif binfmt_misc x86_pkg_temp_thermal intel_powerclamp coretemp nls_iso8859_1 kvm_intel kvm irqbypass crct10dif_pclmul polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 aesni_intel crypto_simd cryptd rapl intel_cstate joydev input_leds mgag200 hpilo drm_shmem_helper ioatdma dca acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_tad mac_hid acpi_power_meter sch_fq_codel msr parport_pc ppdev lp parport efi_pstore ip_tables x_tables autofs4 ib_uverbs ib_core amdgpu(OE) amddrm_ttm_helper(OE) amdttm(OE) amdxcp(OE) iommu_v2 amddrm_buddy(OE) drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt cec rc_core amd_sched(OE) amdkcl(OE) hid_generic drm usbhid hid i2c_algo_bit crc32_pclmul video nvme i2c_i801 tg3 i2c_smbus lpc_ich hpsa nvme_core xhci_pci xhci_pci_renesas nvme_common scsi_transport_sas wmi
Oct 24 04:39:18 ai-dev kernel: CPU: 16 PID: 2250 Comm: hiptest Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
Oct 24 04:39:18 ai-dev kernel: Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, BIOS P89 07/18/2022
Oct 24 04:39:18 ai-dev kernel: RIP: 0010:amdgpu_vm_make_compute+0x113/0x2d0 [amdgpu]
Oct 24 04:39:18 ai-dev kernel: Code: 0f b6 35 36 19 c3 00 41 80 fe 01 0f 87 c1 01 00 00 41 83 e6 01 75 15 48 c7 c7 00 f1 26 c1 c6 05 18 19 c3 00 01 e8 1d 69 d5 f6 <0f> 0b 0f b6 83 f9 02 00 00 41 89 c6 41 83 e6 01 3c 01 0f 87 03 ea
Oct 24 04:39:18 ai-dev kernel: RSP: 0018:ffffb186a575fca0 EFLAGS: 00010246
Oct 24 04:39:18 ai-dev kernel: RAX: 0000000000000000 RBX: ffff9e8b90ae8000 RCX: 0000000000000000
Oct 24 04:39:18 ai-dev kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
Oct 24 04:39:18 ai-dev kernel: RBP: ffffb186a575fcd0 R08: 0000000000000000 R09: 0000000000000000
Oct 24 04:39:18 ai-dev kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
Oct 24 04:39:18 ai-dev kernel: R13: ffff9e9b0be00000 R14: 0000000000000000 R15: 0000000000000000
Oct 24 04:39:18 ai-dev kernel: FS:  00007ff881a55a80(0000) GS:ffff9e9abfc00000(0000) knlGS:0000000000000000
Oct 24 04:39:18 ai-dev kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Oct 24 04:39:18 ai-dev kernel: CR2: 00000000013a8ef8 CR3: 0000000108776004 CR4: 00000000001706e0
Oct 24 04:39:18 ai-dev kernel: Call Trace:
Oct 24 04:39:18 ai-dev kernel:  <TASK>
Oct 24 04:39:18 ai-dev kernel:  ? show_regs+0x72/0x90
Oct 24 04:39:18 ai-dev kernel:  ? amdgpu_vm_make_compute+0x113/0x2d0 [amdgpu]
Oct 24 04:39:18 ai-dev kernel:  ? __warn+0x8d/0x160
Oct 24 04:39:18 ai-dev kernel:  ? amdgpu_vm_make_compute+0x113/0x2d0 [amdgpu]
Oct 24 04:39:18 ai-dev kernel:  ? report_bug+0x1bb/0x1d0
Oct 24 04:39:18 ai-dev kernel:  ? handle_bug+0x46/0x90
Oct 24 04:39:18 ai-dev kernel:  ? exc_invalid_op+0x19/0x80
Oct 24 04:39:18 ai-dev kernel:  ? asm_exc_invalid_op+0x1b/0x20
Oct 24 04:39:18 ai-dev kernel:  ? amdgpu_vm_make_compute+0x113/0x2d0 [amdgpu]
Oct 24 04:39:18 ai-dev kernel:  amdgpu_amdkfd_gpuvm_acquire_process_vm+0x35/0x550 [amdgpu]
Oct 24 04:39:18 ai-dev kernel:  kfd_process_device_init_vm+0xbb/0x320 [amdgpu]
Oct 24 04:39:18 ai-dev kernel:  kfd_ioctl_acquire_vm+0x91/0xd0 [amdgpu]
Oct 24 04:39:18 ai-dev kernel:  kfd_ioctl+0x3ac/0x500 [amdgpu]
Oct 24 04:39:18 ai-dev kernel:  ? __pfx_kfd_ioctl_acquire_vm+0x10/0x10 [amdgpu]
Oct 24 04:39:18 ai-dev kernel:  ? call_rcu+0xe/0x20
Oct 24 04:39:18 ai-dev kernel:  ? __fput+0x12b/0x290
Oct 24 04:39:18 ai-dev kernel:  __x64_sys_ioctl+0x9d/0xe0
Oct 24 04:39:18 ai-dev kernel:  do_syscall_64+0x5c/0x90
Oct 24 04:39:18 ai-dev kernel:  ? do_syscall_64+0x69/0x90
Oct 24 04:39:18 ai-dev kernel:  ? do_syscall_64+0x69/0x90
Oct 24 04:39:18 ai-dev kernel:  ? do_syscall_64+0x69/0x90
Oct 24 04:39:18 ai-dev kernel:  ? syscall_exit_to_user_mode+0x38/0x60
Oct 24 04:39:18 ai-dev kernel:  ? do_syscall_64+0x69/0x90
Oct 24 04:39:18 ai-dev kernel:  ? do_syscall_64+0x69/0x90
Oct 24 04:39:18 ai-dev kernel:  entry_SYSCALL_64_after_hwframe+0x73/0xdd
Oct 24 04:39:18 ai-dev kernel: RIP: 0033:0x7ff88151ab3f
Oct 24 04:39:18 ai-dev kernel: Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <41> 89 c0 3d 00 f0 ff ff 77 1f 48 8b 44 24 18 64 48 2b 04 25 28 00
Oct 24 04:39:18 ai-dev kernel: RSP: 002b:00007ffe76dcd040 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Oct 24 04:39:18 ai-dev kernel: RAX: ffffffffffffffda RBX: 00007ffe76dcd210 RCX: 00007ff88151ab3f
Oct 24 04:39:18 ai-dev kernel: RDX: 00007ffe76dcd210 RSI: 0000000040084b15 RDI: 0000000000000003
Oct 24 04:39:18 ai-dev kernel: RBP: 0000000040084b15 R08: 0000000000000007 R09: 0000000000000002
Oct 24 04:39:18 ai-dev kernel: R10: 000000000135ef40 R11: 0000000000000246 R12: 00000000013609d0
Oct 24 04:39:18 ai-dev kernel: R13: 0000000000000003 R14: 00007ff8788bd040 R15: 00000000000000c0
Oct 24 04:39:18 ai-dev kernel:  </TASK>
Oct 24 04:39:18 ai-dev kernel: ---[ end trace 0000000000000000 ]---

I open issue in HIP with PoC and references.
For now I think this issue can be closed.
Thanks!

@kotee4ko
Copy link
Author

Okey, Sir, dmesg logs:

descr: attempt to parallel task execution on 2 GPUs, using hiptorch (latest)
cmd: torchrun --standalone --nproc_per_node=2 train.py
train.py from: github. com/rogerallen/llama2.cu.git


[278972.676901] gmc_v9_0_process_interrupt: 1 callbacks suppressed
[278972.676912] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.679479] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefbff000 from IH client 0x12 (VMC)
[278972.680744] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00800031
[278972.682010] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.683228] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x1
[278972.684404] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.685582] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[278972.686763] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.687921] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.689097] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.691535] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefbf9000 from IH client 0x12 (VMC)
[278972.692794] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.694052] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.695273] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.696450] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.697629] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.698831] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.699603] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.700089] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.702337] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefbfa000 from IH client 0x12 (VMC)
[278972.703619] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.704856] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.706076] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.707276] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.708443] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.709619] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.710797] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.711965] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.714393] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefbfb000 from IH client 0x12 (VMC)
[278972.715662] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.716895] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.718109] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.719229] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.720310] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.721400] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.722426] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.723431] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.725526] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefbfc000 from IH client 0x12 (VMC)
[278972.726576] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.727564] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.728526] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.729488] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.730398] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.731270] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.732136] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.733025] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.734784] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefbfd000 from IH client 0x12 (VMC)
[278972.735674] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.736543] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.737402] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.738242] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.739016] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.739788] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.740560] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.741359] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.742922] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefbfe000 from IH client 0x12 (VMC)
[278972.743715] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.744491] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.745268] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.745998] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.746693] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.747387] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.748080] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.748781] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.750255] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefbff000 from IH client 0x12 (VMC)
[278972.751020] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.751727] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.752417] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.753096] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.754076] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.754710] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.755338] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.755974] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.757300] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefc00000 from IH client 0x12 (VMC)
[278972.757994] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.758643] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.759267] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.759878] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.760485] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.761106] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.761712] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[278972.762297] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 827117 thread ipython3 pid 827117)
[278972.763506] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007efaefc01000 from IH client 0x12 (VMC)
[278972.764132] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[278972.764746] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[278972.765362] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[278972.765934] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[278972.766469] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[278972.767002] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[278972.767533] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[279123.157079] INFO: task ipython3:827117 blocked for more than 120 seconds.
[279123.158109]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.159292] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.160483] task:ipython3        state:D stack:0     pid:827117 ppid:742680 flags:0x00000002
[279123.160493] Call Trace:
[279123.160497]  <TASK>
[279123.160504]  __schedule+0x2b7/0x5f0
[279123.160516]  schedule+0x68/0x110
[279123.160521]  do_exit+0xf3/0x6c0
[279123.160532]  do_group_exit+0x35/0x90
[279123.160540]  get_signal+0x8a5/0x8d0
[279123.160547]  ? sysvec_call_function+0x4e/0xb0
[279123.160556]  arch_do_signal_or_restart+0x2a/0x120
[279123.160566]  exit_to_user_mode_loop+0xaf/0x140
[279123.160576]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.160580]  irqentry_exit_to_user_mode+0x9/0x20
[279123.160587]  irqentry_exit+0x43/0x50
[279123.160593]  sysvec_reschedule_ipi+0x7b/0x120
[279123.160599]  asm_sysvec_reschedule_ipi+0x1b/0x20
[279123.160609] RIP: 0033:0x7effa44501e2
[279123.160614] RSP: 002b:00007ffc5215e260 EFLAGS: 00000202
[279123.160619] RAX: 00007effa48a5610 RBX: 0000000000000001 RCX: ffffffffffffffff
[279123.160623] RDX: 0006b45300000000 RSI: 00007effac003e00 RDI: 000055f6f102f410
[279123.160626] RBP: 000055f6f6942f90 R08: 0000000000000001 R09: 00007effa4758f00
[279123.160629] R10: 0000000000007efa R11: 0cb7fa7a47432051 R12: 0000000000000002
[279123.160633] R13: 0000000000000001 R14: 000055f6f6942fb0 R15: 0000000000000001
[279123.160640]  </TASK>
[279123.160643] INFO: task ipython3:827118 blocked for more than 120 seconds.
[279123.161842]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.163001] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.164146] task:ipython3        state:D stack:0     pid:827118 ppid:742680 flags:0x00004002
[279123.164154] Call Trace:
[279123.164156]  <TASK>
[279123.164159]  __schedule+0x2b7/0x5f0
[279123.164166]  schedule+0x68/0x110
[279123.164170]  do_exit+0xf3/0x6c0
[279123.164177]  do_group_exit+0x35/0x90
[279123.164185]  get_signal+0x8a5/0x8d0
[279123.164193]  arch_do_signal_or_restart+0x2a/0x120
[279123.164200]  exit_to_user_mode_loop+0xaf/0x140
[279123.164208]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.164212]  syscall_exit_to_user_mode+0x2a/0x60
[279123.164219]  do_syscall_64+0x69/0x90
[279123.164228]  ? do_syscall_64+0x69/0x90
[279123.164235]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.164241]  ? do_syscall_64+0x69/0x90
[279123.164247]  ? do_syscall_64+0x69/0x90
[279123.164253]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.164259] RIP: 0033:0x7effb1291117
[279123.164263] RSP: 002b:00007effaefcbec0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.164268] RAX: fffffffffffffe00 RBX: 00007effa801d2d0 RCX: 00007effb1291117
[279123.164271] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007effa801d2d0
[279123.164274] RBP: 0000000000000000 R08: 0000000000000000 R09: 00000000ffffffff
[279123.164276] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.164279] R13: 0000000000000000 R14: 000055f6e98a406c R15: 20c49ba5e353f7cf
[279123.164286]  </TASK>
[279123.164288] INFO: task ipython3:827200 blocked for more than 120 seconds.
[279123.165445]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.166599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.167759] task:ipython3        state:D stack:0     pid:827200 ppid:742680 flags:0x00004002
[279123.167766] Call Trace:
[279123.167768]  <TASK>
[279123.167770]  __schedule+0x2b7/0x5f0
[279123.167777]  schedule+0x68/0x110
[279123.167781]  do_exit+0xf3/0x6c0
[279123.167789]  do_group_exit+0x35/0x90
[279123.167796]  get_signal+0x8a5/0x8d0
[279123.167803]  arch_do_signal_or_restart+0x2a/0x120
[279123.167810]  exit_to_user_mode_loop+0xaf/0x140
[279123.167818]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.167822]  syscall_exit_to_user_mode+0x2a/0x60
[279123.167829]  do_syscall_64+0x69/0x90
[279123.167835]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.167841] RIP: 0033:0x7effb1291117
[279123.167844] RSP: 002b:00007effae5cbcf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.167848] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007effb1291117
[279123.167851] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007efe61fcfae0
[279123.167854] RBP: 00007efe61fcfab8 R08: 0000000000000000 R09: 00000000ffffffff
[279123.167856] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.167859] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efe61fcfae0
[279123.167864]  </TASK>
[279123.167866] INFO: task ipython3:827201 blocked for more than 120 seconds.
[279123.169050]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.170237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.171434] task:ipython3        state:D stack:0     pid:827201 ppid:742680 flags:0x00004002
[279123.171440] Call Trace:
[279123.171442]  <TASK>
[279123.171445]  __schedule+0x2b7/0x5f0
[279123.171451]  schedule+0x68/0x110
[279123.171455]  do_exit+0xf3/0x6c0
[279123.171463]  do_group_exit+0x35/0x90
[279123.171470]  get_signal+0x8a5/0x8d0
[279123.171477]  arch_do_signal_or_restart+0x2a/0x120
[279123.171484]  exit_to_user_mode_loop+0xaf/0x140
[279123.171492]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.171496]  syscall_exit_to_user_mode+0x2a/0x60
[279123.171504]  do_syscall_64+0x69/0x90
[279123.171510]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.171516]  ? do_syscall_64+0x69/0x90
[279123.171522]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.171528]  ? do_syscall_64+0x69/0x90
[279123.171534]  ? do_syscall_64+0x69/0x90
[279123.171540]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.171546] RIP: 0033:0x7effb1291117
[279123.171549] RSP: 002b:00007efe5f7fecf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.171553] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007effb1291117
[279123.171556] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007efe61fcfb60
[279123.171558] RBP: 00007efe61fcfb38 R08: 0000000000000000 R09: 00000000ffffffff
[279123.171561] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.171564] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efe61fcfb60
[279123.171569]  </TASK>
[279123.171572] INFO: task ipython3:827202 blocked for more than 120 seconds.
[279123.172782]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.173992] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.175169] task:ipython3        state:D stack:0     pid:827202 ppid:742680 flags:0x00004002
[279123.175174] Call Trace:
[279123.175176]  <TASK>
[279123.175180]  __schedule+0x2b7/0x5f0
[279123.175186]  schedule+0x68/0x110
[279123.175190]  do_exit+0xf3/0x6c0
[279123.175198]  do_group_exit+0x35/0x90
[279123.175205]  get_signal+0x8a5/0x8d0
[279123.175211]  arch_do_signal_or_restart+0x2a/0x120
[279123.175218]  exit_to_user_mode_loop+0xaf/0x140
[279123.175226]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.175230]  syscall_exit_to_user_mode+0x2a/0x60
[279123.175237]  do_syscall_64+0x69/0x90
[279123.175244]  ? do_syscall_64+0x69/0x90
[279123.175249]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.175255]  ? do_syscall_64+0x69/0x90
[279123.175261]  ? exit_to_user_mode_prepare+0x3b/0xd0
[279123.175265]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.175271]  ? do_syscall_64+0x69/0x90
[279123.175278]  ? do_syscall_64+0x69/0x90
[279123.175284]  ? exit_to_user_mode_prepare+0x3b/0xd0
[279123.175287]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.175294]  ? do_syscall_64+0x69/0x90
[279123.175299]  ? do_syscall_64+0x69/0x90
[279123.175305]  ? do_syscall_64+0x69/0x90
[279123.175311]  ? do_syscall_64+0x69/0x90
[279123.175317]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.175323] RIP: 0033:0x7effb1291117
[279123.175326] RSP: 002b:00007efe5effdcf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.175330] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007effb1291117
[279123.175332] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007efe61fcfbe0
[279123.175335] RBP: 00007efe61fcfbb8 R08: 0000000000000000 R09: 00000000ffffffff
[279123.175338] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.175340] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efe61fcfbe0
[279123.175345]  </TASK>
[279123.175347] INFO: task ipython3:827203 blocked for more than 120 seconds.
[279123.176512]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.177687] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.178861] task:ipython3        state:D stack:0     pid:827203 ppid:742680 flags:0x00004002
[279123.178867] Call Trace:
[279123.178869]  <TASK>
[279123.178872]  __schedule+0x2b7/0x5f0
[279123.178878]  schedule+0x68/0x110
[279123.178882]  do_exit+0xf3/0x6c0
[279123.178889]  do_group_exit+0x35/0x90
[279123.178896]  get_signal+0x8a5/0x8d0
[279123.178903]  arch_do_signal_or_restart+0x2a/0x120
[279123.178910]  exit_to_user_mode_loop+0xaf/0x140
[279123.178918]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.178922]  syscall_exit_to_user_mode+0x2a/0x60
[279123.178928]  do_syscall_64+0x69/0x90
[279123.178935]  ? exit_to_user_mode_prepare+0x3b/0xd0
[279123.178939]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.178945]  ? do_syscall_64+0x69/0x90
[279123.178951]  ? exit_to_user_mode_prepare+0x3b/0xd0
[279123.178954]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.178961]  ? do_syscall_64+0x69/0x90
[279123.178966]  ? do_syscall_64+0x69/0x90
[279123.178972]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.178978] RIP: 0033:0x7effb1291117
[279123.178981] RSP: 002b:00007efe5c7fccf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.178985] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007effb1291117
[279123.178987] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007efe61fcfc60
[279123.178990] RBP: 00007efe61fcfc38 R08: 0000000000000000 R09: 00000000ffffffff
[279123.178993] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.178995] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efe61fcfc60
[279123.179001]  </TASK>
[279123.179003] INFO: task ipython3:827204 blocked for more than 120 seconds.
[279123.180183]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.181403] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.182612] task:ipython3        state:D stack:0     pid:827204 ppid:742680 flags:0x00004002
[279123.182617] Call Trace:
[279123.182620]  <TASK>
[279123.182623]  __schedule+0x2b7/0x5f0
[279123.182629]  schedule+0x68/0x110
[279123.182633]  do_exit+0xf3/0x6c0
[279123.182641]  do_group_exit+0x35/0x90
[279123.182648]  get_signal+0x8a5/0x8d0
[279123.182655]  arch_do_signal_or_restart+0x2a/0x120
[279123.182663]  exit_to_user_mode_loop+0xaf/0x140
[279123.182671]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.182675]  syscall_exit_to_user_mode+0x2a/0x60
[279123.182682]  do_syscall_64+0x69/0x90
[279123.182688]  ? do_syscall_64+0x69/0x90
[279123.182695]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.182701]  ? do_syscall_64+0x69/0x90
[279123.182707]  ? schedule+0x68/0x110
[279123.182711]  ? exit_to_user_mode_prepare+0x3b/0xd0
[279123.182715]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.182721]  ? do_syscall_64+0x69/0x90
[279123.182727]  ? do_syscall_64+0x69/0x90
[279123.182733]  ? do_syscall_64+0x69/0x90
[279123.182738]  ? do_syscall_64+0x69/0x90
[279123.182744]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.182750] RIP: 0033:0x7effb1291117
[279123.182753] RSP: 002b:00007efe59ffbcf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.182757] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007effb1291117
[279123.182760] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007efe61fcfce0
[279123.182762] RBP: 00007efe61fcfcb8 R08: 0000000000000000 R09: 00000000ffffffff
[279123.182765] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.182767] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efe61fcfce0
[279123.182772]  </TASK>
[279123.182775] INFO: task ipython3:827205 blocked for more than 120 seconds.
[279123.183992]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.185234] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.186455] task:ipython3        state:D stack:0     pid:827205 ppid:742680 flags:0x00004002
[279123.186460] Call Trace:
[279123.186462]  <TASK>
[279123.186465]  __schedule+0x2b7/0x5f0
[279123.186471]  schedule+0x68/0x110
[279123.186475]  do_exit+0xf3/0x6c0
[279123.186483]  do_group_exit+0x35/0x90
[279123.186490]  get_signal+0x8a5/0x8d0
[279123.186497]  arch_do_signal_or_restart+0x2a/0x120
[279123.186504]  exit_to_user_mode_loop+0xaf/0x140
[279123.186512]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.186516]  syscall_exit_to_user_mode+0x2a/0x60
[279123.186523]  do_syscall_64+0x69/0x90
[279123.186530]  ? do_syscall_64+0x69/0x90
[279123.186535]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.186542]  ? do_syscall_64+0x69/0x90
[279123.186547]  ? do_syscall_64+0x69/0x90
[279123.186553]  ? do_syscall_64+0x69/0x90
[279123.186559]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.186565] RIP: 0033:0x7effb1291117
[279123.186568] RSP: 002b:00007efe557facf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.186572] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007effb1291117
[279123.186574] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007efe61fcfd60
[279123.186577] RBP: 00007efe61fcfd38 R08: 0000000000000000 R09: 00000000ffffffff
[279123.186579] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.186582] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efe61fcfd60
[279123.186587]  </TASK>
[279123.186589] INFO: task ipython3:827206 blocked for more than 120 seconds.
[279123.187785]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.188983] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.190171] task:ipython3        state:D stack:0     pid:827206 ppid:742680 flags:0x00004002
[279123.190177] Call Trace:
[279123.190179]  <TASK>
[279123.190181]  __schedule+0x2b7/0x5f0
[279123.190187]  schedule+0x68/0x110
[279123.190191]  do_exit+0xf3/0x6c0
[279123.190200]  do_group_exit+0x35/0x90
[279123.190207]  get_signal+0x8a5/0x8d0
[279123.190214]  arch_do_signal_or_restart+0x2a/0x120
[279123.190221]  exit_to_user_mode_loop+0xaf/0x140
[279123.190229]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.190233]  syscall_exit_to_user_mode+0x2a/0x60
[279123.190240]  do_syscall_64+0x69/0x90
[279123.190246]  ? raw_spin_rq_unlock+0x10/0x40
[279123.190255]  ? __schedule+0x4f5/0x5f0
[279123.190258]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.190265]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.190271]  ? schedule+0x68/0x110
[279123.190275]  ? exit_to_user_mode_prepare+0x3b/0xd0
[279123.190279]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.190285]  ? do_syscall_64+0x69/0x90
[279123.190291]  ? do_syscall_64+0x69/0x90
[279123.190297]  ? do_syscall_64+0x69/0x90
[279123.190303]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.190309] RIP: 0033:0x7effb1291117
[279123.190313] RSP: 002b:00007efe54ff9cf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.190316] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007effb1291117
[279123.190319] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007efe61fcfde0
[279123.190322] RBP: 00007efe61fcfdb8 R08: 0000000000000000 R09: 00000000ffffffff
[279123.190324] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.190327] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efe61fcfde0
[279123.190332]  </TASK>
[279123.190334] INFO: task ipython3:827207 blocked for more than 120 seconds.
[279123.191527]       Tainted: G        W  OE      6.2.0-35-generic #35~22.04.1-Ubuntu
[279123.192730] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[279123.193957] task:ipython3        state:D stack:0     pid:827207 ppid:742680 flags:0x00004002
[279123.193962] Call Trace:
[279123.193964]  <TASK>
[279123.193967]  __schedule+0x2b7/0x5f0
[279123.193973]  schedule+0x68/0x110
[279123.193977]  do_exit+0xf3/0x6c0
[279123.193985]  do_group_exit+0x35/0x90
[279123.193992]  get_signal+0x8a5/0x8d0
[279123.193999]  arch_do_signal_or_restart+0x2a/0x120
[279123.194007]  exit_to_user_mode_loop+0xaf/0x140
[279123.194015]  exit_to_user_mode_prepare+0xb9/0xd0
[279123.194019]  syscall_exit_to_user_mode+0x2a/0x60
[279123.194026]  do_syscall_64+0x69/0x90
[279123.194033]  ? exit_to_user_mode_prepare+0x3b/0xd0
[279123.194037]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.194044]  ? do_syscall_64+0x69/0x90
[279123.194050]  ? schedule+0x68/0x110
[279123.194054]  ? exit_to_user_mode_prepare+0x3b/0xd0
[279123.194058]  ? syscall_exit_to_user_mode+0x38/0x60
[279123.194065]  ? do_syscall_64+0x69/0x90
[279123.194071]  ? do_syscall_64+0x69/0x90
[279123.194077]  ? do_syscall_64+0x69/0x90
[279123.194082]  ? do_syscall_64+0x69/0x90
[279123.194088]  entry_SYSCALL_64_after_hwframe+0x73/0xdd
[279123.194094] RIP: 0033:0x7effb1291117
[279123.194097] RSP: 002b:00007efe507f8cf0 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
[279123.194100] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007effb1291117
[279123.194103] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00007efe61fcfe60
[279123.194105] RBP: 00007efe61fcfe38 R08: 0000000000000000 R09: 00000000ffffffff
[279123.194108] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[279123.194110] R13: 0000000000000000 R14: 0000000000000000 R15: 00007efe61fcfe60
[279123.194115]  </TASK>
[280936.299484] usb 4-10: USB disconnect, device number 51
[280936.607342] usb 4-10: new low-speed USB device number 52 using xhci_hcd
[280936.759882] usb 4-10: New USB device found, idVendor=1c4f, idProduct=0034, bcdDevice= 1.10
[280936.759892] usb 4-10: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[280936.759897] usb 4-10: Product: Usb Mouse
[280936.759901] usb 4-10: Manufacturer: SIGMACHIP
[280936.763069] input: SIGMACHIP Usb Mouse as /devices/pci0000:00/0000:00:14.0/usb4/4-10/4-10:1.0/0003:1C4F:0034.0033/input/input53
[280936.763308] hid-generic 0003:1C4F:0034.0033: input,hidraw0: USB HID v1.10 Mouse [SIGMACHIP Usb Mouse] on usb-0000:00:14.0-10/input0
[282248.278254] gmc_v9_0_process_interrupt: 87 callbacks suppressed
[282248.278260] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.279449] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99e00f000 from IH client 0x12 (VMC)
[282248.280001] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00800031
[282248.280561] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.281126] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x1
[282248.281670] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.282207] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[282248.282743] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.283278] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.283817] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.285178] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99e016000 from IH client 0x12 (VMC)
[282248.285770] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00800031
[282248.286320] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.286830] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x1
[282248.287330] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.287824] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[282248.288320] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.288878] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.289443] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.290506] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99dff6000 from IH client 0x12 (VMC)
[282248.291057] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[282248.291596] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.292123] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[282248.292665] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.293231] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[282248.293742] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.294250] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.294764] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.295825] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99dff7000 from IH client 0x12 (VMC)
[282248.296375] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[282248.296966] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.297543] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[282248.298077] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.298589] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[282248.299095] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.299597] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.300101] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.301192] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99dff8000 from IH client 0x12 (VMC)
[282248.301782] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[282248.302318] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.302851] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[282248.303377] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.303899] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[282248.304417] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.304999] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.305547] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.306588] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99dff9000 from IH client 0x12 (VMC)
[282248.307122] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[282248.307649] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.308173] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[282248.308724] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.309286] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[282248.309794] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.310299] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.310786] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.311778] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99dffa000 from IH client 0x12 (VMC)
[282248.312286] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[282248.312839] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.313387] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[282248.313873] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.314352] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[282248.314830] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.315308] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.315792] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.316833] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99dffb000 from IH client 0x12 (VMC)
[282248.317390] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[282248.317900] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.318396] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[282248.318881] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.319360] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[282248.319838] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.320315] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.320843] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.321886] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99dffc000 from IH client 0x12 (VMC)
[282248.322391] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[282248.322884] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.323365] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[282248.324192] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.324731] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[282248.325260] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.325737] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[282248.326222] amdgpu 0000:88:00.0: amdgpu: [mmhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32768, for process ipython3 pid 846049 thread ipython3 pid 846049)
[282248.327221] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x00007fe99dffd000 from IH client 0x12 (VMC)
[282248.327743] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[282248.328252] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: MP0 (0x0)
[282248.328775] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[282248.329311] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[282248.329790] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x0
[282248.330269] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[282248.330749] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0
[283788.472877] gmc_v9_0_process_interrupt: 113 callbacks suppressed
[283788.472887] amdgpu 0000:8b:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32769, for process python3 pid 856758 thread python3 pid 856758)
[283788.475339] amdgpu 0000:8b:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[283788.476645] amdgpu 0000:8b:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00801030
[283788.477313] amdgpu 0000:88:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:8 pasid:32772, for process python3 pid 856757 thread python3 pid 856757)
[283788.477898] amdgpu 0000:8b:00.0: amdgpu:     Faulty UTCL2 client ID: TCP (0x8)
[283788.479373] amdgpu 0000:88:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[283788.480618] amdgpu 0000:8b:00.0: amdgpu:     MORE_FAULTS: 0x0
[283788.480621] amdgpu 0000:8b:00.0: amdgpu:     WALKER_ERROR: 0x0
[283788.480623] amdgpu 0000:8b:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[283788.481649] amdgpu 0000:88:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00801030
[283788.482894] amdgpu 0000:8b:00.0: amdgpu:     MAPPING_ERROR: 0x0
[283788.483652] amdgpu 0000:88:00.0: amdgpu:     Faulty UTCL2 client ID: TCP (0x8)
[283788.484902] amdgpu 0000:8b:00.0: amdgpu:     RW: 0x0
[283788.485658] amdgpu 0000:88:00.0: amdgpu:     MORE_FAULTS: 0x0
[283788.489109] amdgpu 0000:88:00.0: amdgpu:     WALKER_ERROR: 0x0
[283788.489994] amdgpu 0000:88:00.0: amdgpu:     PERMISSION_FAULTS: 0x3
[283788.491112] amdgpu 0000:88:00.0: amdgpu:     MAPPING_ERROR: 0x0
[283788.492110] amdgpu 0000:88:00.0: amdgpu:     RW: 0x0


@kentrussell

@kotee4ko
Copy link
Author

following this tutorial: https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html

dmesg:


[  110.313943] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.
[  110.318253] amdgpu: Runlist is getting oversubscribed. Expect reduced ROCm performance.
[  110.707027] amdgpu 0000:8b:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:14 pasid:32773, for process python3 pid 1802 thread python3 pid 1802)
[  110.707038] amdgpu 0000:8b:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  110.707047] amdgpu 0000:8b:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00E01030
[  110.707050] amdgpu 0000:8b:00.0: amdgpu:      Faulty UTCL2 client ID: TCP (0x8)
[  110.707052] amdgpu 0000:8b:00.0: amdgpu:      MORE_FAULTS: 0x0
[  110.707054] amdgpu 0000:8b:00.0: amdgpu:      WALKER_ERROR: 0x0
[  110.707056] amdgpu 0000:8b:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[  110.707058] amdgpu 0000:8b:00.0: amdgpu:      MAPPING_ERROR: 0x0
[  110.707060] amdgpu 0000:8b:00.0: amdgpu:      RW: 0x0

source code of test3.py

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Parameters and DataLoaders
input_size = 5
output_size = 2

batch_size = 30
data_size = 100

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
class RandomDataset(Dataset):

    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len

rand_loader = DataLoader(dataset=RandomDataset(input_size, data_size),
                         batch_size=batch_size, shuffle=True)

class Model(nn.Module):
    # Our model

    def __init__(self, input_size, output_size):
        super(Model, self).__init__()
        self.fc = nn.Linear(input_size, output_size)

    def forward(self, input):
        output = self.fc(input)
        print("\tIn Model: input size", input.size(),
              "output size", output.size())

        return output


model = Model(input_size, output_size)
if torch.cuda.device_count() > 1:
  print("Let's use", torch.cuda.device_count(), "GPUs!")
  # dim = 0 [30, xxx] -> [10, ...], [10, ...], [10, ...] on 3 GPUs
  model = nn.DataParallel(model)

model.to(device)


for data in rand_loader:
    input = data.to(device)
    output = model(input)
    print("Outside: input size", input.size(),
          "output_size", output.size())



|root@ai-dev|:{/opt/AI/llama2.cu/transformer} #_ torchrun --standalone  --nproc_per_node=2 ./test3.py 
[2023-10-29 16:28:23,941] torch.distributed.run: [WARNING] master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
[2023-10-29 16:28:23,942] torch.distributed.run: [WARNING] 
[2023-10-29 16:28:23,942] torch.distributed.run: [WARNING] *****************************************
[2023-10-29 16:28:23,942] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2023-10-29 16:28:23,942] torch.distributed.run: [WARNING] *****************************************
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Let's use 2 GPUs!
/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
Let's use 2 GPUs!
Memory access fault by GPU node-3 (Agent handle: 0x556bfa092cd0) on address (nil). Reason: Page not present or supervisor privilege.
   In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
   In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
   In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
   In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
   In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
   In Model: input size torch.Size([15, 5]) output size torch.Size([15, 2])
Outside: input size torch.Size([30, 5]) output_size torch.Size([30, 2])
   In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
   In Model: input size torch.Size([5, 5]) output size torch.Size([5, 2])
Outside: input size torch.Size([10, 5]) output_size torch.Size([10, 2])
[2023-10-29 16:28:29,095] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 1802) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
=====================================================
./test3.py FAILED
-----------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-10-29_16:28:29
  host      : ai-dev
  rank      : 1 (local_rank: 1)
  exitcode  : -6 (pid: 1802)
  error_file: <N/A>
  traceback : Signal 6 (SIGABRT) received by PID 1802
=====================================================


GPUs load during execution:


|root@ai-dev|:{~} #_ rocm-smi 


========================= ROCm System Management Interface =========================
=================================== Concise Info ===================================
GPU  Temp (DieEdge)  AvgPwr  SCLK     MCLK    Fan    Perf  PwrCap  VRAM%  GPU%  
0    46.0c           51.0W   1440Mhz  945Mhz  9.41%  auto  110.0W    3%   96%   
1    35.0c           55.0W   1440Mhz  945Mhz  9.41%  auto  110.0W    3%   97%   
====================================================================================
=============================== End of ROCm SMI Log ================================


@kentrussell

@kotee4ko kotee4ko reopened this Oct 29, 2023
@kentrussell
Copy link
Contributor

I think it's worth checking with the HIP guys. The fact that you have an address of 0x0 means that there's a bad address being passed in from somewhere. We can work our way down the stack to find it, but I think the HIP guys will have a good initial assessment on why they're using a NULL address:
[ 110.707038] amdgpu 0000:8b:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)

@kotee4ko
Copy link
Author

I think it's worth checking with the HIP guys. The fact that you have an address of 0x0 means that there's a bad address being passed in from somewhere. We can work our way down the stack to find it, but I think the HIP guys will have a good initial assessment on why they're using a NULL address: [ 110.707038] amdgpu 0000:8b:00.0: amdgpu: in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)

Ummm, Sir, I'm really sorry for disturbing you, but I don't know who to ping about this question from HIP team devs,
so, may I ask you about help me to ping needed person, please?
Here is the issue on HIP space: ROCm/HIP#3352

And, maybe it is possible to implement smthn like quick and dirty work-around to this stack of ring-3 bugs, and perform manual r/w op of vram?

If there is a way to do so?

Kindly appreciated, Sir. @kentrussell

@kentrussell
Copy link
Contributor

Hopefully they'll see it and reply. I know that the HIP bug reports tend to take a bit longer to get to, and it's only been 1 workday since you opened the issue.
As for the workaround, it would be impossible on the kernel side until it was established where that NULL pointer came from. rocgdb may help to find where the address becomes 0, but without that, you're basically just trying to make your own memory manager from scratch, which will take a lot more time unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants