Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIP: can't use GPU with official Tensorflow or PyTorch ROCM containers with Ryzen 5600G #207109

Closed
lucasew opened this issue Dec 21, 2022 · 41 comments

Comments

@lucasew
Copy link
Contributor

lucasew commented Dec 21, 2022

Describe the bug

I have a Ryzen 5600G APU and I am trying to use Tensorflow or PyTorch to do some machine learning stuff. So far whatever one, I am just trying to make it recognize the GPU and make it usable, and so far I was only able to use it on Blender with blender-hip or a workaround to use it with blender-bin.

Steps To Reproduce

Steps to reproduce the behavior:

For PyTorch

  1. docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/pytorch:latest
  2. python
  3. import torch
  4. Error: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" Aborted (core dumped)

For TensorFlow

  1. docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined rocm/tensorflow:latest
  2. python
  3. import tensorflow as tf
  4. tf.config.list_physical_devices()
  5. Error: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!" Aborted (core dumped)

If I do an export HSA_OVERRIDE_GFX_VERSION=10.3.0 and do any activity that actually uses the GPU, like torch.tensor([[1,2],[3,4]]).to(torch.device('cuda') it crashes and dmesg shows the following:

[  810.761484] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761488] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761492] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761499] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x008012B1
[  810.761500] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: SQC (inst) (0x9)
[  810.761501] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  810.761502] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  810.761503] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0xb
[  810.761503] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  810.761504] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  810.761507] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761509] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761516] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761516] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  810.761517] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  810.761518] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  810.761518] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  810.761519] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  810.761520] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  810.761521] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761522] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761528] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761529] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  810.761530] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  810.761530] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  810.761531] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  810.761532] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  810.761532] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  810.761536] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761542] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761543] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  810.761543] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  810.761544] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  810.761545] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  810.761545] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  810.761546] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  810.761547] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761549] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761555] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761555] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  810.761556] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  810.761557] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  810.761557] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  810.761558] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  810.761559] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  810.761560] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761561] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761567] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761568] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  810.761568] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  810.761569] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  810.761570] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  810.761570] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  810.761571] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  810.761572] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761573] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761579] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761580] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  810.761581] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  810.761581] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  810.761582] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  810.761582] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  810.761583] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  810.761584] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32772, for process python pid 2536 thread python pid 2536)
[  810.761585] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  810.761591] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  810.761592] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  810.761593] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  810.761593] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  810.761594] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  810.761595] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  810.761595] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  814.761529] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  814.761535] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 6, err_type 2
[  814.761537] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 6, err_type 2
[  814.761538] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 6, err_type 2
[  814.761539] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 6, err_type 2
[  814.761540] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 5, err_type 2
[  814.761541] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 5, err_type 2
[  814.761542] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 5, err_type 2
[  814.761543] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 5, err_type 2
[  814.761544] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 4, err_type 2
[  814.761545] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 4, err_type 2
[  814.761545] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 4, err_type 2
[  814.761546] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 4, err_type 2
[  814.761547] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 3, err_type 2
[  814.761548] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 3, err_type 2
[  814.761549] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 3, err_type 2
[  814.761550] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 3, err_type 2
[  814.761550] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 2, err_type 2
[  814.761551] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 2, err_type 2
[  814.761552] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 2, err_type 2
[  814.761553] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 2, err_type 2
[  814.761554] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 1, err_type 2
[  814.761554] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 1, err_type 2
[  814.761555] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 1, err_type 2
[  814.761556] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 1, err_type 2
[  814.761557] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 0, err_type 2
[  814.761558] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 0, err_type 2
[  814.761558] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 0, err_type 2
[  814.761559] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 0, err_type 2
[  817.502308] ------------[ cut here ]------------
[  817.502313] WARNING: CPU: 11 PID: 2550 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230
[  817.502320] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip6t_rpfilter ipt_rpfilter snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio led_class xt_pkttype snd_hda_codec_hdmi xt_LOG nf_log_syslog xt_tcpudp nls_iso8859_1 nft_compat nls_cp437 vfat snd_hda_intel fat snd_intel_dspcfg snd_intel_sdw_acpi nft_counter snd_hda_codec intel_rapl_msr wmi_bmof snd_hda_core evdev snd_hwdep r8169 mac_hid snd_pcm realtek snd_timer mdio_devres nf_tables edac_mce_amd snd libphy edac_core soundcore intel_rapl_common libcrc32c crc32_pclmul ghash_clmulni_intel video nfnetlink sp5100_tco aesni_intel watchdog i2c_piix4 k10temp sch_fq_codel libaes deflate crypto_simd cryptd gpio_amdpt efi_pstore gpio_generic wmi pinctrl_amd tiny_power_button acpi_cpufreq rapl button ctr atkbd libps2 serio loop veth
[  817.502356]  bridge stp llc tun vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd kvm irqbypass fuse pstore configfs efivarfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata usbcore nvme scsi_mod nvme_core crc32c_intel t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  817.502381] CPU: 11 PID: 2550 Comm: python Tainted: G        W  O      5.15.82 #1-NixOS
[  817.502383] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  817.502384] RIP: 0010:__flush_work.isra.0+0x21f/0x230
[  817.502386] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 f2 00 81 00 66 90 0f 1f 44 00 00
[  817.502388] RSP: 0018:ffffab1142487b28 EFLAGS: 00010246
[  817.502389] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  817.502390] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8f0281325318
[  817.502391] RBP: ffff8f0281325318 R08: 0000000000000000 R09: ffffffffbe250a50
[  817.502391] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8f0281325318
[  817.502392] R13: 0000000000000001 R14: 0000000000000003 R15: ffff8f02b8399d8c
[  817.502392] FS:  0000000000000000(0000) GS:ffff8f058e4c0000(0000) knlGS:0000000000000000
[  817.502393] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  817.502394] CR2: 00007ffea315c414 CR3: 0000000148124000 CR4: 0000000000750ee0
[  817.502395] PKRU: 55555554
[  817.502396] Call Trace:
[  817.502398]  <TASK>
[  817.502401]  ? del_timer+0x55/0x80
[  817.502404]  __cancel_work_timer+0x11a/0x1b0
[  817.502406]  kfd_process_notifier_release+0x8b/0x160 [amdgpu]
[  817.502571]  __mmu_notifier_release+0x73/0x210
[  817.502577]  exit_mmap+0x1ad/0x1f0
[  817.502580]  ? delayacct_add_tsk+0x63/0x1b0
[  817.502582]  ? exit_robust_list+0x5c/0x140
[  817.502584]  ? __cond_resched+0x16/0x50
[  817.502586]  ? mutex_lock+0xe/0x30
[  817.502587]  mmput+0x5a/0x140
[  817.502590]  do_exit+0x2f0/0xa40
[  817.502592]  do_group_exit+0x33/0xa0
[  817.502594]  get_signal+0x14a/0x910
[  817.502595]  arch_do_signal_or_restart+0x101/0x730
[  817.502598]  ? do_send_sig_info+0x6b/0xc0
[  817.502600]  ? do_tkill+0x88/0xb0
[  817.502601]  exit_to_user_mode_prepare+0x10e/0x230
[  817.502603]  syscall_exit_to_user_mode+0x18/0x40
[  817.502605]  do_syscall_64+0x48/0x90
[  817.502607]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  817.502608] RIP: 0033:0x7f5550f8400b
[  817.502631] Code: Unable to access opcode bytes at RIP 0x7f5550f83fe1.
[  817.502632] RSP: 002b:00007f5335a6eb20 EFLAGS: 00000246 ORIG_RAX: 000000000000000e
[  817.502633] RAX: 0000000000000000 RBX: 00007f5335a6f700 RCX: 00007f5550f8400b
[  817.502634] RDX: 0000000000000000 RSI: 00007f5335a6eb20 RDI: 0000000000000002
[  817.502635] RBP: 00007f5335a6ee30 R08: 0000000000000000 R09: 00007f5335a6eb20
[  817.502635] R10: 0000000000000008 R11: 0000000000000246 R12: 000056099afb14e0
[  817.502636] R13: 0000000000000000 R14: 00007f5335a6edd0 R15: 0000000000000003
[  817.502637]  </TASK>
[  817.502638] ---[ end trace 1cc27b60f1089df3 ]---
[  821.502652] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  821.502654] amdgpu: Resetting wave fronts (cpsch) on dev 000000008c1046c5

Expected behavior

Machine learning working the same as it would work in Google Colab I guess

Screenshots

If applicable, add screenshots to help explain your problem.

Additional context

Nixcfg revision used to replicate the issue: https://github.com/lucasew/nixcfg/tree/ff430dc0992d9247989f739a326536f87e345d98/nodes/whiterun

A PC with a i5 6400 + RX460 has the same problem but I don't have access to it anymore to test eventual fixes.

Notify maintainers

@NixOS/rocm-maintainers

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

lucasew@whiterun ~ 134$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.82, NixOS, 22.11 (Raccoon), 22.11.20221216.9d692a7`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.11.0`
 - nixpkgs: `/etc/flake/nixpkgs`
@Madouura
Copy link
Contributor

Madouura commented Dec 21, 2022

Can you use the hip from nixos-unstable and tell me if it still gives you that error?
Looking through my PRs concerning the ROCm packages I can't find anything that could cause this aside from possibly #206421, and that's not in master yet.
Also try using hip from staging (#206421) if you can, see if that works.
From what I can see, you're using a docker container and that should have it's own hip, which may be the problem instead of nixpkg's hip.

@Madouura
Copy link
Contributor

Madouura commented Dec 21, 2022

I should also mention that I am working on native ROCm support for pytorch and tensorflow in nixpkgs so you don't need to use those docker containers, but that's going to take some time.

@Madouura
Copy link
Contributor

Also try export HSA_OVERRIDE_GFX_VERSION=9.0.0 instead.

@Flakebi
Copy link
Member

Flakebi commented Dec 21, 2022

As far as I see, a Ryzen 5600G has a Vega GPU (gfx9), so I’m not surprised that everything crashes when you force gfx10.3 behavior – two generations later – with HSA_OVERRIDE_GFX_VERSION=10.3.0 :)
It seems to be a gfx90c card, so HSA_OVERRIDE_GFX_VERSION=9.0.12 should be more correct.

@lucasew
Copy link
Contributor Author

lucasew commented Dec 21, 2022

As far as I see, a Ryzen 5600G has a Vega GPU (gfx9), so I’m not surprised that everything crashes when you force gfx10.3 behavior – two generations later – with HSA_OVERRIDE_GFX_VERSION=10.3.0 :) It seems to be a gfx90c card, so HSA_OVERRIDE_GFX_VERSION=9.0.12 should be more correct.

About this generation thing I have no idea what I am doing xD just saw people mentioning this on the Internet and decided to try.

Can you use the hip from nixos-unstable and tell me if it still gives you that error? Looking through my PRs concerning the ROCm packages I can't find anything that could cause this aside from possibly #206421, and that's not in master yet. Also try using hip from staging (#206421) if you can, see if that works. From what I can see, you're using a docker container and that should have it's own hip, which may be the problem instead of nixpkg's hip.

Switched to latest unstable rn

  • Both HSA_OVERRIDE_GFX_VERSION=9.0.12 and HSA_OVERRIDE_GFX_VERSION=9.0.0
>>> import torch
"hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"
Aborted (core dumped)

  • HSA_OVERRIDE_GFX_VERSION=10.3.0
[  306.174866] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[  306.174872] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  306.174879] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x008012B1
[  306.174881] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: SQC (inst) (0x9)
[  306.174882] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  306.174883] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  306.174884] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0xb
[  306.174885] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  306.174886] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  306.174889] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[  306.174891] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  306.174898] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  306.174899] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  306.174900] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  306.174901] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  306.174902] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  306.174903] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  306.174904] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  306.174906] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:221 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[  306.174907] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  306.174914] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  306.174915] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  306.174916] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  306.174917] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  306.174918] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  306.174918] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  306.174919] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  306.174922] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[  306.174924] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  306.174931] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  306.174931] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  306.174932] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  306.174933] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  306.174934] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  306.174935] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  306.174936] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  306.174937] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[  306.174939] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  306.174945] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  306.174946] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  306.174947] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  306.174948] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  306.174949] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  306.174950] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  306.174951] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  306.174952] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[  306.174954] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  306.174960] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  306.174961] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  306.174962] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  306.174963] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  306.174964] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  306.174965] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  306.174965] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  306.174967] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[  306.174968] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  306.174975] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  306.174976] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  306.174977] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  306.174977] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  306.174978] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  306.174979] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  306.174980] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  306.174981] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 2315 thread python pid 2315)
[  306.174983] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x0000000000000000 from IH client 0x1b (UTCL2)
[  306.174989] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  306.174990] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  306.174991] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  306.174992] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  306.174993] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  306.174994] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  306.174995] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  310.174910] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  310.174915] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 4, err_type 2
[  310.174918] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 4, err_type 2
[  310.174919] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 4, err_type 2
[  310.174920] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 4, err_type 2
[  310.174921] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 3, err_type 2
[  310.174922] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 3, err_type 2
[  310.174923] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 6, err_type 2
[  310.174923] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 6, err_type 2
[  310.174924] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 6, err_type 2
[  310.174925] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 6, err_type 2
[  310.174926] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 5, err_type 2
[  310.174927] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 5, err_type 2
[  310.174927] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 5, err_type 2
[  310.174928] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 5, err_type 2
[  310.174929] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 3, err_type 2
[  310.174930] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 3, err_type 2
[  310.174931] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 2, err_type 2
[  310.174931] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 2, err_type 2
[  310.174932] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 2, err_type 2
[  310.174933] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 2, err_type 2
[  310.174934] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 1, err_type 2
[  310.174935] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 1, err_type 2
[  310.174936] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 1, err_type 2
[  310.174936] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 1, err_type 2
[  310.174937] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 1, cu_id 0, err_type 2
[  310.174938] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 2, cu_id 0, err_type 2
[  310.174939] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 3, cu_id 0, err_type 2
[  310.174940] amdgpu: sq_intr: error, se 0, data 0x0, sh 0, priv 1, wave_id 0, simd_id 0, cu_id 0, err_type 2
[  312.816528] ------------[ cut here ]------------
[  312.816531] WARNING: CPU: 2 PID: 2329 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230
[  312.816537] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_hda_codec_realtek ip6t_rpfilter ipt_rpfilter snd_hda_codec_generic ledtrig_audio led_class snd_hda_codec_hdmi xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat snd_hda_intel snd_intel_dspcfg nft_counter intel_rapl_msr snd_intel_sdw_acpi snd_hda_codec edac_mce_amd evdev wmi_bmof mac_hid edac_core intel_rapl_common snd_hda_core crc32_pclmul ghash_clmulni_intel aesni_intel snd_hwdep snd_pcm libaes crypto_simd r8169 cryptd nf_tables rapl realtek snd_timer libcrc32c mdio_devres sp5100_tco watchdog snd sch_fq_codel nfnetlink libphy soundcore k10temp i2c_piix4 video gpio_amdpt gpio_generic pinctrl_amd tiny_power_button wmi acpi_cpufreq button ctr atkbd libps2 serio loop veth bridge stp llc tun
[  312.816570]  vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata nvme usbcore crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  312.816595] CPU: 2 PID: 2329 Comm: python Tainted: G        W  O      5.15.83 #1-NixOS
[  312.816597] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  312.816598] RIP: 0010:__flush_work.isra.0+0x21f/0x230
[  312.816600] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00
[  312.816601] RSP: 0018:ffffb14001cb7b28 EFLAGS: 00010246
[  312.816602] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  312.816603] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff92872a69ab18
[  312.816604] RBP: ffff92872a69ab18 R08: 0000000000000000 R09: ffffffff96450b50
[  312.816604] R10: 0000000000000000 R11: 0000000000000000 R12: ffff92872a69ab18
[  312.816605] R13: 0000000000000001 R14: 0000000000000003 R15: ffff928705e5272c
[  312.816606] FS:  0000000000000000(0000) GS:ffff928a0e280000(0000) knlGS:0000000000000000
[  312.816606] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  312.816607] CR2: 0000000000d133a0 CR3: 000000012388a000 CR4: 0000000000750ee0
[  312.816608] PKRU: 55555554
[  312.816609] Call Trace:
[  312.816611]  <TASK>
[  312.816614]  ? del_timer+0x55/0x80
[  312.816617]  __cancel_work_timer+0x11a/0x1b0
[  312.816619]  kfd_process_notifier_release+0x8b/0x160 [amdgpu]
[  312.816786]  __mmu_notifier_release+0x73/0x210
[  312.816790]  exit_mmap+0x1ad/0x1f0
[  312.816793]  ? delayacct_add_tsk+0x63/0x1b0
[  312.816795]  ? exit_robust_list+0x5c/0x140
[  312.816796]  ? __cond_resched+0x16/0x50
[  312.816799]  ? mutex_lock+0xe/0x30
[  312.816800]  mmput+0x5a/0x140
[  312.816802]  do_exit+0x2f0/0xa40
[  312.816805]  do_group_exit+0x33/0xa0
[  312.816806]  get_signal+0x14a/0x910
[  312.816808]  arch_do_signal_or_restart+0x101/0x730
[  312.816810]  ? do_send_sig_info+0x6b/0xc0
[  312.816812]  ? do_tkill+0x88/0xb0
[  312.816813]  exit_to_user_mode_prepare+0x10e/0x230
[  312.816815]  syscall_exit_to_user_mode+0x18/0x40
[  312.816826]  do_syscall_64+0x48/0x90
[  312.816829]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  312.816831] RIP: 0033:0x7fd93b15d00b
[  312.816848] Code: Unable to access opcode bytes at RIP 0x7fd93b15cfe1.
[  312.816849] RSP: 002b:00007fd71fc46b20 EFLAGS: 00000246 ORIG_RAX: 000000000000000e
[  312.816850] RAX: 0000000000000000 RBX: 00007fd71fc47700 RCX: 00007fd93b15d00b
[  312.816851] RDX: 0000000000000000 RSI: 00007fd71fc46b20 RDI: 0000000000000002
[  312.816851] RBP: 00007fd71fc46e30 R08: 0000000000000000 R09: 00007fd71fc46b20
[  312.816852] R10: 0000000000000008 R11: 0000000000000246 R12: 000055c4de1234d0
[  312.816852] R13: 0000000000000000 R14: 00007fd71fc46dd0 R15: 0000000000000003
[  312.816853]  </TASK>
[  312.816854] ---[ end trace 25d048475f484f4d ]---
[  316.816865] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  316.816869] amdgpu: Resetting wave fronts (cpsch) on dev 00000000a08df1ec
[  368.819686] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  368.819692] amdgpu: Failed to evict process queues
[  368.819693] amdgpu: Failed to evict queues of pasid 0x8003
[  368.819712] ------------[ cut here ]------------
[  368.819714] WARNING: CPU: 11 PID: 2437 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230
[  368.819736] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_hda_codec_realtek ip6t_rpfilter ipt_rpfilter snd_hda_codec_generic ledtrig_audio led_class snd_hda_codec_hdmi xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat snd_hda_intel snd_intel_dspcfg nft_counter intel_rapl_msr snd_intel_sdw_acpi snd_hda_codec edac_mce_amd evdev wmi_bmof mac_hid edac_core intel_rapl_common snd_hda_core crc32_pclmul ghash_clmulni_intel aesni_intel snd_hwdep snd_pcm libaes crypto_simd r8169 cryptd nf_tables rapl realtek snd_timer libcrc32c mdio_devres sp5100_tco watchdog snd sch_fq_codel nfnetlink libphy soundcore k10temp i2c_piix4 video gpio_amdpt gpio_generic pinctrl_amd tiny_power_button wmi acpi_cpufreq button ctr atkbd libps2 serio loop veth bridge stp llc tun
[  368.819797]  vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata nvme usbcore crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  368.819833] CPU: 11 PID: 2437 Comm: python Tainted: G        W  O      5.15.83 #1-NixOS
[  368.819836] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  368.819837] RIP: 0010:__flush_work.isra.0+0x21f/0x230
[  368.819840] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00
[  368.819843] RSP: 0018:ffffb14001d07b28 EFLAGS: 00010246
[  368.819844] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  368.819846] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff9287271f1318
[  368.819847] RBP: ffff9287271f1318 R08: 0000000000000000 R09: ffffffff96450b50
[  368.819848] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9287271f1318
[  368.819849] R13: 0000000000000001 R14: 0000000000000003 R15: ffff928705e528ac
[  368.819850] FS:  0000000000000000(0000) GS:ffff928a0e4c0000(0000) knlGS:0000000000000000
[  368.819851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  368.819852] CR2: 00007f4631ed0000 CR3: 0000000106b2e000 CR4: 0000000000750ee0
[  368.819853] PKRU: 55555554
[  368.819854] Call Trace:
[  368.819857]  <TASK>
[  368.819858]  ? __cond_resched+0x31/0x50
[  368.819865]  ? __wait_for_common+0x3b/0x160
[  368.819866]  ? srcu_gp_start_if_needed+0x23b/0x3e0
[  368.819870]  __cancel_work_timer+0x11a/0x1b0
[  368.819873]  kfd_process_notifier_release+0x8b/0x160 [amdgpu]
[  368.820071]  __mmu_notifier_release+0x73/0x210
[  368.820076]  exit_mmap+0x1ad/0x1f0
[  368.820079]  ? delayacct_add_tsk+0x63/0x1b0
[  368.820081]  ? exit_robust_list+0x5c/0x140
[  368.820083]  ? __cond_resched+0x16/0x50
[  368.820084]  ? mutex_lock+0xe/0x30
[  368.820085]  mmput+0x5a/0x140
[  368.820088]  do_exit+0x2f0/0xa40
[  368.820089]  do_group_exit+0x33/0xa0
[  368.820090]  get_signal+0x14a/0x910
[  368.820093]  arch_do_signal_or_restart+0x101/0x730
[  368.820095]  ? do_send_sig_info+0x6b/0xc0
[  368.820096]  ? do_tkill+0x88/0xb0
[  368.820098]  exit_to_user_mode_prepare+0x10e/0x230
[  368.820099]  syscall_exit_to_user_mode+0x18/0x40
[  368.820102]  do_syscall_64+0x48/0x90
[  368.820103]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  368.820104] RIP: 0033:0x7f464524500b
[  368.820117] Code: Unable to access opcode bytes at RIP 0x7f4645244fe1.
[  368.820117] RSP: 002b:00007ffe5e59eb00 EFLAGS: 00000246 ORIG_RAX: 000000000000000e
[  368.820118] RAX: 0000000000000000 RBX: 00007f4645201340 RCX: 00007f464524500b
[  368.820119] RDX: 0000000000000000 RSI: 00007ffe5e59eb00 RDI: 0000000000000002
[  368.820119] RBP: 00007ffe5e59ef70 R08: 0000000000000000 R09: 00007ffe5e59eb00
[  368.820119] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001
[  368.820120] R13: 00007ffe5e59ef00 R14: 00007f4631edd000 R15: 00007ffe5e59ef20
[  368.820121]  </TASK>
[  368.820121] ---[ end trace 25d048475f484f4e ]---
[  368.820176] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  368.820176] amdgpu: Resetting wave fronts (cpsch) on dev 00000000a08df1ec
[  390.784230] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  390.784238] amdgpu: Failed to evict process queues
[  390.784239] amdgpu: Failed to evict queues of pasid 0x8003
[  390.784252] ------------[ cut here ]------------
[  390.784254] WARNING: CPU: 2 PID: 2466 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230
[  390.784260] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 snd_hda_codec_realtek ip6t_rpfilter ipt_rpfilter snd_hda_codec_generic ledtrig_audio led_class snd_hda_codec_hdmi xt_pkttype xt_LOG nf_log_syslog xt_tcpudp nft_compat snd_hda_intel snd_intel_dspcfg nft_counter intel_rapl_msr snd_intel_sdw_acpi snd_hda_codec edac_mce_amd evdev wmi_bmof mac_hid edac_core intel_rapl_common snd_hda_core crc32_pclmul ghash_clmulni_intel aesni_intel snd_hwdep snd_pcm libaes crypto_simd r8169 cryptd nf_tables rapl realtek snd_timer libcrc32c mdio_devres sp5100_tco watchdog snd sch_fq_codel nfnetlink libphy soundcore k10temp i2c_piix4 video gpio_amdpt gpio_generic pinctrl_amd tiny_power_button wmi acpi_cpufreq button ctr atkbd libps2 serio loop veth bridge stp llc tun
[  390.784296]  vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata nvme usbcore crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  390.784321] CPU: 2 PID: 2466 Comm: python Tainted: G        W  O      5.15.83 #1-NixOS
[  390.784322] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  390.784323] RIP: 0010:__flush_work.isra.0+0x21f/0x230
[  390.784326] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00
[  390.784327] RSP: 0018:ffffb14001e17b28 EFLAGS: 00010246
[  390.784328] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  390.784329] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff928718bf9318
[  390.784330] RBP: ffff928718bf9318 R08: 0000000000000000 R09: ffffffff96450b50
[  390.784330] R10: 0000000000000000 R11: 0000000000000000 R12: ffff928718bf9318
[  390.784331] R13: 0000000000000001 R14: 0000000000000003 R15: ffff928705e5218c
[  390.784331] FS:  0000000000000000(0000) GS:ffff928a0e280000(0000) knlGS:0000000000000000
[  390.784332] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  390.784333] CR2: 000055620ed12fe4 CR3: 00000001472ac000 CR4: 0000000000750ee0
[  390.784334] PKRU: 55555554
[  390.784335] Call Trace:
[  390.784336]  <TASK>
[  390.784337]  ? __cond_resched+0x31/0x50
[  390.784341]  ? __wait_for_common+0x3b/0x160
[  390.784343]  ? srcu_gp_start_if_needed+0x23b/0x3e0
[  390.784345]  __cancel_work_timer+0x11a/0x1b0
[  390.784347]  kfd_process_notifier_release+0x8b/0x160 [amdgpu]
[  390.784493]  __mmu_notifier_release+0x73/0x210
[  390.784498]  exit_mmap+0x1ad/0x1f0
[  390.784501]  ? delayacct_add_tsk+0x63/0x1b0
[  390.784503]  ? exit_robust_list+0x5c/0x140
[  390.784505]  ? __cond_resched+0x16/0x50
[  390.784506]  ? mutex_lock+0xe/0x30
[  390.784507]  mmput+0x5a/0x140
[  390.784510]  do_exit+0x2f0/0xa40
[  390.784511]  do_group_exit+0x33/0xa0
[  390.784513]  get_signal+0x14a/0x910
[  390.784514]  arch_do_signal_or_restart+0x101/0x730
[  390.784517]  ? do_send_sig_info+0x6b/0xc0
[  390.784518]  ? do_tkill+0x88/0xb0
[  390.784519]  exit_to_user_mode_prepare+0x10e/0x230
[  390.784521]  syscall_exit_to_user_mode+0x18/0x40
[  390.784523]  do_syscall_64+0x48/0x90
[  390.784525]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  390.784527] RIP: 0033:0x7fd99a92700b
[  390.784540] Code: Unable to access opcode bytes at RIP 0x7fd99a926fe1.
[  390.784540] RSP: 002b:00007fff8bcc0520 EFLAGS: 00000246 ORIG_RAX: 000000000000000e
[  390.784542] RAX: 0000000000000000 RBX: 00007fd99a8e3340 RCX: 00007fd99a92700b
[  390.784542] RDX: 0000000000000000 RSI: 00007fff8bcc0520 RDI: 0000000000000002
[  390.784543] RBP: 00007fff8bcc0990 R08: 0000000000000000 R09: 00007fff8bcc0520
[  390.784543] R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000001
[  390.784544] R13: 00007fff8bcc0920 R14: 00007fd935f15000 R15: 00007fff8bcc0940
[  390.784545]  </TASK>
[  390.784545] ---[ end trace 25d048475f484f4f ]---
[  390.784559] amdgpu: HIQ MQD's queue_doorbell_id0 is not 0, Queue preemption time out
[  390.784560] amdgpu: Resetting wave fronts (cpsch) on dev 00000000a08df1ec

Edit 1: I am now switching it to staging. It didn't started build screaming (yet).

@Madouura
Copy link
Contributor

Madouura commented Dec 22, 2022

About this generation thing I have no idea what I am doing xD just saw people mentioning this on the Internet and decided to try.

I'm in the same boat, it's how #197885 started lol.
Anyway, I think I gave you bad advice, while you should try staging and the other things, please try Flakebi's suggestion first, as it's likely what the actual problem is.
Nevermind, there it is, my bad reading comprehension again.

@Madouura
Copy link
Contributor

rocm/pytorch:latest

Try without the latest tag, again this should just be an issue with the docker container.

@lucasew
Copy link
Contributor Author

lucasew commented Dec 22, 2022

Same problem on staging

@Madouura
Copy link
Contributor

Madouura commented Dec 23, 2022

I haven't gotten tensorflow working yet, but you should be able to use pytorch now when the next staging-next and #206995 is merged.
If you wanna test now, see: Madouura@df71e71
You may need to add roctracer and rccl to LD_LIBRARY_PATH.

@lucasew
Copy link
Contributor Author

lucasew commented Dec 26, 2022

I think I found a bug in nix shell

lucasew@whiterun ~ 0$ nix shell github:Madouura/nixpkgs/df71e711026a37178f9a258f236db0e1a66e2f0b#legacyPackages.x86_64-linux.{python3Packages.torchWithRocm,roctracer,rccl,python3} -c python 
Python 3.10.9 (main, Dec  6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'torch'

@Madouura
Copy link
Contributor

I haven't gotten that problem, I may have linked you a bad build.
Try Madouura@f6d4e98.

@Madouura
Copy link
Contributor

Oh, this is interesting. I didn't realize nix shell was supposed to propagate. That explains a lot and may be linked to some of the issues I've had in #206995.

@lucasew
Copy link
Contributor Author

lucasew commented Dec 26, 2022

Tested with the following shell.nix (workaround of that issue)

{ pkgs ? import (builtins.fetchTarball "https://github.com/Madouura/nixpkgs/archive/f6d4e98b49a52fe564b832e20527b527fa2c90a6.tar.gz") {} }:
pkgs.mkShell {
  buildInputs = with pkgs; [
    python3Packages.torchWithRocm
  ];
}

Same problem of the container so far. But I returned to stable. I will try with the latest staging commit.

@Madouura
Copy link
Contributor

Madouura commented Dec 27, 2022

Try this.
nix-shell -I nixpkgs=${nixpkgs-at-f6d4e98b49a52fe564b832e20527b527fa2c90a6} -p python3Packages.torchWithRocm
python ./benchmark.py

import torch, timeit

print(f"CUDA support: {torch.cuda.is_available()} (Should be \"True\")")
print(f"CUDA version: {torch.version.cuda} (Should be \"None\")")
print(f"HIP version: {torch.version.hip} (Should contain \"5.4\")")

# Storing ID of current CUDA device
cuda_id = torch.cuda.current_device()
print(f"Current CUDA device ID: {torch.cuda.current_device()}")
print(f"Current CUDA device name: {torch.cuda.get_device_name(cuda_id)} (Should be AMD, not NVIDIA)")

def batched_dot_mul_sum(a, b):
    '''Computes batched dot by multiplying and summing'''
    return a.mul(b).sum(-1)


def batched_dot_bmm(a, b):
    '''Computes batched dot by reducing to bmm'''
    a = a.reshape(-1, 1, a.shape[-1])
    b = b.reshape(-1, b.shape[-1], 1)
    return torch.bmm(a, b).flatten(-3)

x = torch.randn(10000, 1024, device='cuda')

t0 = timeit.Timer(
    stmt='batched_dot_mul_sum(x, x)',
    setup='from __main__ import batched_dot_mul_sum',
    globals={'x': x})

t1 = timeit.Timer(
    stmt='batched_dot_bmm(x, x)',
    setup='from __main__ import batched_dot_bmm',
    globals={'x': x})

# Ran each twice to show difference before/after warmup
print(f'mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'mul_sum(x, x):  {t0.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us')
print(f'bmm(x, x):      {t1.timeit(100) / 100 * 1e6:>5.1f} us')

If everything is working, everything should match what's in the parenthesis and if you have something like corectrl, you'll see a GPU frequency spike when it is running.

@Madouura
Copy link
Contributor

If that still doesn't work, it may honestly just be possible that the Ryzen 5600G just isn't supported.
It theoretically should be though, since it's Vega IIRC.

@Madouura
Copy link
Contributor

@Flakebi If you have an AMD GPU, could you run this check/benchmark as well to confirm it isn't just working for me and only me?

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

Same problem.

Built my NixOS config against the staging right after #206421 was merged because the latest staging failed in the middle of the build because of an unrelated package.

This is the shell.nix I am using to provision torch based on the commit you mentioned:

let
  nixpkgs = builtins.fetchTarball "https://github.com/NixOS/nixpkgs/archive/f6d4e98b49a52fe564b832e20527b527fa2c90a6.tar.gz";
  pkgs = import nixpkgs { };
in pkgs.mkShell {
  buildInputs = with pkgs; [ python3Packages.torchWithRocm ];
}

This is my Python prompt after nix-shell the shell.nix above

lucasew@whiterun ~/demo-hip-issue 0$ nix-shell
(shell:impure) lucasew@whiterun ~/demo-hip-issue 0$ python
Python 3.10.9 (main, Dec  6 2022, 18:44:57) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')
... 
... )
Memory access fault by GPU node-1 (Agent handle: 0x7817470) on address 0x735d000. Reason: Unknown.
Aborted (imagem do núcleo gravada)

Whiterun is running https://github.com/lucasew/nixcfg/tree/811c58b6b9c743fab692fb3fc7817ded83974b6c

And this is what I got in the dmesg right after I ran that Python snippet.

[  292.842655] gmc_v9_0_process_interrupt: 34 callbacks suppressed
[  292.842658] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:157 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842662] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842670] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00801031
[  292.842670] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: TCP (0x8)
[  292.842671] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x1
[  292.842672] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842672] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x3
[  292.842673] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842673] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842675] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842677] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842683] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842684] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842684] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842685] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842685] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842686] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842686] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842687] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842689] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842695] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842695] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842696] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842696] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842697] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842697] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842698] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842698] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842699] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842705] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842706] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842706] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842707] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842707] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842708] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842708] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842709] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842710] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842716] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842716] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842717] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842717] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842718] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842718] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842719] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842720] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842721] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842726] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842727] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842728] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842728] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842728] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842729] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842729] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842730] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842731] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842737] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842738] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842738] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842739] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842739] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842740] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842740] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842741] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842742] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842745] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842745] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842746] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842746] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842747] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842747] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842748] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842750] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842751] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842754] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842754] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842755] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842755] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842756] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842756] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842757] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  292.842758] amdgpu 0000:07:00.0: amdgpu: [gfxhub0] retry page fault (src_id:0 ring:0 vmid:8 pasid:32771, for process python pid 5974 thread python pid 5974)
[  292.842759] amdgpu 0000:07:00.0: amdgpu:   in page starting at address 0x000000000735d000 from IH client 0x1b (UTCL2)
[  292.842762] amdgpu 0000:07:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00000000
[  292.842763] amdgpu 0000:07:00.0: amdgpu: 	 Faulty UTCL2 client ID: CB (0x0)
[  292.842763] amdgpu 0000:07:00.0: amdgpu: 	 MORE_FAULTS: 0x0
[  292.842764] amdgpu 0000:07:00.0: amdgpu: 	 WALKER_ERROR: 0x0
[  292.842764] amdgpu 0000:07:00.0: amdgpu: 	 PERMISSION_FAULTS: 0x0
[  292.842765] amdgpu 0000:07:00.0: amdgpu: 	 MAPPING_ERROR: 0x0
[  292.842765] amdgpu 0000:07:00.0: amdgpu: 	 RW: 0x0
[  294.367109] ------------[ cut here ]------------
[  294.367112] WARNING: CPU: 10 PID: 5999 at kernel/workqueue.c:3083 __flush_work.isra.0+0x21f/0x230
[  294.367118] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables snd_hda_codec_realtek xt_conntrack nf_conntrack snd_hda_codec_generic nf_defrag_ipv6 ledtrig_audio led_class nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi ip6t_rpfilter intel_rapl_msr ipt_rpfilter snd_hda_codec edac_mce_amd edac_core wmi_bmof snd_hda_core intel_rapl_common xt_pkttype crc32_pclmul ghash_clmulni_intel evdev snd_hwdep mac_hid aesni_intel xt_LOG snd_pcm nf_log_syslog r8169 libaes crypto_simd cryptd xt_tcpudp sp5100_tco watchdog realtek nft_compat snd_timer rapl mdio_devres nft_counter snd k10temp i2c_piix4 libphy wmi soundcore video gpio_amdpt tiny_power_button gpio_generic pinctrl_amd button acpi_cpufreq nf_tables libcrc32c nfnetlink sch_fq_codel ctr atkbd libps2 serio loop veth bridge stp llc tun
[  294.367154]  vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata usbcore nvme crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  294.367178] CPU: 10 PID: 5999 Comm: python Tainted: G           O      5.15.83 #1-NixOS
[  294.367180] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  294.367181] RIP: 0010:__flush_work.isra.0+0x21f/0x230
[  294.367183] Code: 8b 4d 00 4c 8b 45 08 89 ca 48 c1 e9 04 83 e2 08 83 e1 0f 83 ca 02 89 c8 48 0f ba 6d 00 03 e9 13 ff ff ff 0f 0b e9 45 ff ff ff <0f> 0b 45 31 ed e9 3b ff ff ff e8 e2 31 81 00 66 90 0f 1f 44 00 00
[  294.367184] RSP: 0018:ffffb6b381d9fb28 EFLAGS: 00010246
[  294.367186] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  294.367186] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff953eb5a54718
[  294.367187] RBP: ffff953eb5a54718 R08: 0000000000000000 R09: ffffffff99650b50
[  294.367187] R10: 0000000000000000 R11: 0000000000000000 R12: ffff953eb5a54718
[  294.367188] R13: 0000000000000001 R14: 0000000000000003 R15: ffff953e98cb7bac
[  294.367189] FS:  0000000000000000(0000) GS:ffff95418e280000(0000) knlGS:0000000000000000
[  294.367190] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  294.367190] CR2: 00007fb0193dbff8 CR3: 00000001014b6000 CR4: 0000000000750ee0
[  294.367191] PKRU: 55555554
[  294.367192] Call Trace:
[  294.367194]  <TASK>
[  294.367196]  ? del_timer+0x55/0x80
[  294.367199]  __cancel_work_timer+0x11a/0x1b0
[  294.367201]  kfd_process_notifier_release+0x8b/0x160 [amdgpu]
[  294.367338]  __mmu_notifier_release+0x73/0x210
[  294.367342]  exit_mmap+0x1ad/0x1f0
[  294.367345]  ? delayacct_add_tsk+0x63/0x1b0
[  294.367347]  ? exit_robust_list+0x5c/0x140
[  294.367349]  ? __cond_resched+0x16/0x50
[  294.367351]  ? mutex_lock+0xe/0x30
[  294.367353]  mmput+0x5a/0x140
[  294.367356]  do_exit+0x2f0/0xa40
[  294.367357]  do_group_exit+0x33/0xa0
[  294.367358]  get_signal+0x14a/0x910
[  294.367360]  arch_do_signal_or_restart+0x101/0x730
[  294.367363]  ? do_send_sig_info+0x6b/0xc0
[  294.367364]  ? do_tkill+0x88/0xb0
[  294.367365]  exit_to_user_mode_prepare+0x10e/0x230
[  294.367367]  syscall_exit_to_user_mode+0x18/0x40
[  294.367369]  do_syscall_64+0x48/0x90
[  294.367371]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  294.367373] RIP: 0033:0x7fb1e899cbc7
[  294.367389] Code: Unable to access opcode bytes at RIP 0x7fb1e899cb9d.
[  294.367389] RSP: 002b:00007fb0193deb30 EFLAGS: 00000246 ORIG_RAX: 00000000000000ea
[  294.367390] RAX: 0000000000000000 RBX: 000000000000176f RCX: 00007fb1e899cbc7
[  294.367391] RDX: 0000000000000006 RSI: 000000000000176f RDI: 0000000000001756
[  294.367392] RBP: 0000000001e90d08 R08: 00007fb0193df948 R09: 0000000000000020
[  294.367392] R10: 0000000000000008 R11: 0000000000000246 R12: 00007fb0193ded58
[  294.367393] R13: 0000000000000000 R14: 0000000000000006 R15: 0000000001e90d88
[  294.367394]  </TASK>
[  294.367394] ---[ end trace 511b8352d6af64c6 ]---
[  294.382835] ------------[ cut here ]------------
[  294.382836] WARNING: CPU: 10 PID: 1650 at drivers/gpu/drm/ttm/ttm_bo.c:409 ttm_bo_release+0x2db/0x300 [ttm]
[  294.382843] Modules linked in: af_packet nf_conntrack_netlink xfrm_user xfrm_algo xt_addrtype br_netfilter overlay xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_reject_ipv4 nft_chain_nat nf_nat rfkill nls_iso8859_1 nls_cp437 vfat fat ip6_tables snd_hda_codec_realtek xt_conntrack nf_conntrack snd_hda_codec_generic nf_defrag_ipv6 ledtrig_audio led_class nf_defrag_ipv4 snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi ip6t_rpfilter intel_rapl_msr ipt_rpfilter snd_hda_codec edac_mce_amd edac_core wmi_bmof snd_hda_core intel_rapl_common xt_pkttype crc32_pclmul ghash_clmulni_intel evdev snd_hwdep mac_hid aesni_intel xt_LOG snd_pcm nf_log_syslog r8169 libaes crypto_simd cryptd xt_tcpudp sp5100_tco watchdog realtek nft_compat snd_timer rapl mdio_devres nft_counter snd k10temp i2c_piix4 libphy wmi soundcore video gpio_amdpt tiny_power_button gpio_generic pinctrl_amd button acpi_cpufreq nf_tables libcrc32c nfnetlink sch_fq_codel ctr atkbd libps2 serio loop veth bridge stp llc tun
[  294.382866]  vboxnetflt(O) vboxnetadp(O) vboxdrv(O) kvm_amd ccp rng_core kvm irqbypass fuse deflate efi_pstore pstore configfs efivarfs dmi_sysfs ip_tables x_tables autofs4 ext4 crc32c_generic crc16 mbcache jbd2 sd_mod xhci_pci xhci_pci_renesas xhci_hcd ahci libahci libata usbcore nvme crc32c_intel scsi_mod nvme_core t10_pi crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common usb_common scsi_common rtc_cmos dm_mod amdgpu drm_ttm_helper ttm agpgart iommu_v2 gpu_sched i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm i2c_core backlight
[  294.382882] CPU: 10 PID: 1650 Comm: kworker/10:3 Tainted: G        W  O      5.15.83 #1-NixOS
[  294.382883] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./B450M Steel Legend, BIOS P4.00 05/06/2021
[  294.382884] Workqueue: kfd_process_wq kfd_process_wq_release [amdgpu]
[  294.382993] RIP: 0010:ttm_bo_release+0x2db/0x300 [ttm]
[  294.382996] Code: e8 9a 46 2e d8 e9 bb fd ff ff 49 8b 7e 98 b9 30 75 00 00 31 d2 be 01 00 00 00 e8 a0 68 2e d8 49 8b 46 e8 eb 9e 48 89 e8 eb 99 <0f> 0b e9 46 fd ff ff e8 99 44 2e d8 e9 ed fe ff ff be 03 00 00 00
[  294.382997] RSP: 0018:ffffb6b381df7cb8 EFLAGS: 00010202
[  294.382998] RAX: 0000000000000001 RBX: ffffb6b381df7d00 RCX: 0000000080400035
[  294.382999] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff953eb5a531b8
[  294.382999] RBP: ffff953e8a285240 R08: ffff953eb5a531b8 R09: 0000000000000000
[  294.383000] R10: ffff953e9e038540 R11: 0000000000000000 R12: ffff953eaffb7e30
[  294.383000] R13: ffff953eb5a53058 R14: ffff953eb5a531b8 R15: dead000000000100
[  294.383001] FS:  0000000000000000(0000) GS:ffff95418e280000(0000) knlGS:0000000000000000
[  294.383002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  294.383002] CR2: 00007fb0193dbff8 CR3: 000000004be10000 CR4: 0000000000750ee0
[  294.383003] PKRU: 55555554
[  294.383003] Call Trace:
[  294.383005]  <TASK>
[  294.383006]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  294.383071]  amdgpu_gem_object_free+0x30/0x50 [amdgpu]
[  294.383135]  amdgpu_amdkfd_gpuvm_free_memory_of_gpu+0x34f/0x3c0 [amdgpu]
[  294.383211]  kfd_process_device_free_bos+0x9d/0xe0 [amdgpu]
[  294.383281]  kfd_process_wq_release+0x20d/0x2d0 [amdgpu]
[  294.383348]  process_one_work+0x1f1/0x390
[  294.383351]  worker_thread+0x53/0x3e0
[  294.383352]  ? process_one_work+0x390/0x390
[  294.383353]  kthread+0x127/0x150
[  294.383354]  ? set_kthread_struct+0x50/0x50
[  294.383355]  ret_from_fork+0x22/0x30
[  294.383357]  </TASK>
[  294.383358] ---[ end trace 511b8352d6af64c7 ]---

And this is your script output:

(shell:impure) lucasew@whiterun ~/demo-hip-issue 0$ python test-pytorch
CUDA support: True (Should be "True")
CUDA version: None (Should be "None")
HIP version: 5.4.22802-0 (Should contain "5.4")
Current CUDA device ID: 0
Current CUDA device name: AMD Radeon Graphics (Should be AMD, not NVIDIA)
Segmentation fault (imagem do núcleo gravada)

@Madouura
Copy link
Contributor

So it's not torch itself, the commit, or nixpkgs then, everything as far as torch goes matches up.
I honestly would suggest you take this up with AMD, the closest thing I can think of considering all the errors I've seen would be https://github.com/RadeonOpenCompute/ROCm-Device-Libs.
You're still using the machine with the 5600G right?

@Madouura
Copy link
Contributor

Madouura commented Dec 28, 2022

I do have my user in the "video" and "render" groups, just in case that solves your issue, but I doubt it.
https://www.gabriel.urdhr.fr/2022/08/28/trying-to-run-stable-diffusion-on-amd-ryzen-5-5600g also suggests adding your user to "render".

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

I haven't at that time my user in video and render group then I added it. Same problem.

And yeah, 5600G B450, less than a year and got it working with Blender.

BTW those segfaults are hell to debug.

Captura de tela_2022-12-28_07-49-50

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

I think I got something \o/

(shell:impure) lucasew@whiterun ~/demo-hip-issue 139$ HSA_OVERRIDE_GFX_VERSION=9.0.0 ./test-pytorch 
CUDA support: True (Should be "True")
CUDA version: None (Should be "None")
HIP version: 5.4.22802-0 (Should contain "5.4")
Current CUDA device ID: 0
Current CUDA device name: AMD Radeon Graphics (Should be AMD, not NVIDIA)
mul_sum(x, x):  131.0 us
mul_sum(x, x):    9.2 us
bmm(x, x):      330.2 us
bmm(x, x):       18.9 us

@Madouura
Copy link
Contributor

Ahh so it was HSA_OVERRIDE_GFX_VERSION=9.0.0 and maybe the render group after all.

@Madouura
Copy link
Contributor

Try both of those (and video, for good measure) with the docker image, theoretically it should work.

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

Tried to replicate with a fresh reboot.

Same result.

We got it 🥂

For the registry, whiterun is running lucasew/nixcfg@d98b0e2 and I added the group definitions in the bootstrap node, so it propagates to all others.

Captura de tela_2022-12-28_07-58-32

@Madouura
Copy link
Contributor

Glad we got it working!
Gonna close since this isn't a nixpkgs issue, but if there's anything else I can help with, let me know.

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

Well, the issue is actually about the official containers. These are still not working.

Captura de tela_2022-12-28_08-09-23

@lucasew lucasew reopened this Dec 28, 2022
@Madouura
Copy link
Contributor

The official docker containers, right? That's not nixpkgs-related.
I'm not sure why those wouldn't be working.
Maybe docker itself needs to be added to video and render in your nix config?

@Madouura
Copy link
Contributor

...unless this is related to your nix shell issue, but I don't see how that could be...

@Madouura
Copy link
Contributor

Madouura commented Dec 28, 2022

You could also try adding --ipc=host to your docker arguments.
See: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Install-and-Run-on-AMD-GPUs

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

The example that got working is based on nix-shell not nix shell.

--ipc=host is already there.

The full docker run command is docker run -it --network=host --device=/dev/kfd --device=/dev/dri --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined 67a4

67a4 is a container generated from the rocm/pytorch but with the user added to the render group.

BTW, that torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')) is still failing.

@Madouura
Copy link
Contributor

Ugh, reading comprehension again...
Anyway, I've gotten the stable diffusion (webui) docker container working so I'm not sure why the pytorch one isn't working.
I'm afraid I'm out of ideas as far as docker goes, I still don't think this is a nixpkgs issues but in case it's an issue with docker...
cc (docker maintainers) @offlinehacker @tailhook @vdemeester @periklis @mikroskeem @maxeaubrey

@Madouura
Copy link
Contributor

BTW, that torch.tensor([[1,2],[3,4]]).to(torch.device('cuda')) is still failing.

With torchWithRocm, right? Works for me?
Screenshot from 2022-12-28 05-26-01

@mikroskeem
Copy link
Member

I don't think that Docker gets into way anymore that much here, because right device nodes appear to be bound from host, and stock seccomp profile which could block syscalls is disabled as well (seccomp=unconfined). docker run configuration is following what upstream wiki says, unless they're out of date, it should work exactly the same.

Have you looked into stable-diffusion-webui issues about those segfaults? Maybe those give few pointers:

  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/6032
  • https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/4870

I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).

Well, isn't the steam deck gpu basically an RDNA2 GPU? That should work.

@Madouura what's your hardware and where you define the GPU stuff in your config? I may have done mistakes in my config. But yeah, it's based on that staging commit.

@Madouura
Copy link
Contributor

I'm afraid I'm not able to give much insight about this myself, as I don't have CUDA/ROCm capable GPU (...unless Steam Deck APU?).

Well, isn't the steam deck gpu basically an RDNA2 GPU? That should work.

@Madouura what's your hardware and where you define the GPU stuff in your config? I may have done mistakes in my config. But yeah, it's based on that staging commit.

Hopefully this should be enough. One is 6900XT, other is 6800.
Screenshot from 2022-12-28 06-05-40
These should be relevant:

@Madouura
Copy link
Contributor

Madouura commented Dec 28, 2022

Wait a minute... The likely reason why our torch is working and the official docker image isn't working is probably this...

patches = [
# Enable support for gfx8 again
# See the upstream issue: https://github.com/RadeonOpenCompute/ROCm/issues/1659
# And the arch patch: https://github.com/rocm-arch/rocm-arch/pull/742
(fetchpatch {
url = "https://raw.githubusercontent.com/John-Gee/rocm-arch/d6812d308fee3caf2b6bb01b4d19fe03a6a0e3bd/rocm-opencl-runtime/enable-gfx800.patch";
hash = "sha256-59jFDIIsTTZcNns9RyMVWPRUggn/bSlAGrky4quu8B4=";
})
];

IIRC shouldn't the 5600g be gfx8? If so, that's definitely why.
The official docker image isn't an option for you.

@Madouura
Copy link
Contributor

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

I just updated my kernel to linuxPackages_6_0. I was using the default (5.15).

It seems that the stuff is working now, even the container.

Screenshot_20221228-105429

@mikroskeem
Copy link
Member

I suppose this issue can be closed now?

@lucasew
Copy link
Contributor Author

lucasew commented Dec 28, 2022

I just want to test tensorflow before. But if the ROCm layer is known to be working then I suppose no more work is needed in this issue for you to do. Thank you guys. You are awesome.

@Madouura
Copy link
Contributor

Looks like there were some AMD changes in 6.0, go figure.
Glad we could help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants