Skip to content
This repository has been archived by the owner on Sep 27, 2023. It is now read-only.

problem with AMD Gallium/Mesa (open source) GPU driver on Linux #30

Closed
gsohler opened this issue Jul 26, 2018 · 19 comments
Closed

problem with AMD Gallium/Mesa (open source) GPU driver on Linux #30

gsohler opened this issue Jul 26, 2018 · 19 comments
Labels

Comments

@gsohler
Copy link

gsohler commented Jul 26, 2018

I tried to create a nice wire model cube like this, but it only renders @ 1 FPS
this is way slower then originator.
How to write fast curv code ?

union[
for (i in 0..1)
for (j in 0..1)
(
capsule{ d:0.3, from:[i,j,0], to:[i,j,1] } ;
capsule{ d:0.3, from:[i,0,j], to:[i,1,j] } ;
capsule{ d:0.3, from:[0,i,j], to:[1,i,j] }
)
]

@doug-moen
Copy link
Member

It shouldn't be this slow.

For performance, the Curv 3D viewing window depends on access to a hardware GPU made by Intel, AMD or Nvidia, using the vendor-supplied GPU driver. Maybe you are running Curv inside a VM, and the GPU driver is simulating a GPU in software?

@gsohler
Copy link
Author

gsohler commented Jul 27, 2018

Doug, you are right.
I was running curv inside vncviewer (by chance). There its is 1FPs
However, directly its not even displaying anything.
selection_100

I am really interested to get curv working with good performance

/proc/cpuiinfo yields below:

what is wrong with my setup ?

processor : 0
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD FX(tm)-4300 Quad-Core Processor
stepping : 0
microcode : 0x600084f
cpu MHz : 1800.000
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 16
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bugs : fxsave_leak sysret_ss_attrs null_seg
bogomips : 7634.90
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor : 1
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD FX(tm)-4300 Quad-Core Processor
stepping : 0
microcode : 0x600084f
cpu MHz : 1400.000
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 1
cpu cores : 2
apicid : 17
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bugs : fxsave_leak sysret_ss_attrs null_seg
bogomips : 7633.47
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor : 2
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD FX(tm)-4300 Quad-Core Processor
stepping : 0
microcode : 0x600084f
cpu MHz : 1800.000
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 2
apicid : 18
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bugs : fxsave_leak sysret_ss_attrs null_seg
bogomips : 7633.49
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

processor : 3
vendor_id : AuthenticAMD
cpu family : 21
model : 2
model name : AMD FX(tm)-4300 Quad-Core Processor
stepping : 0
microcode : 0x600084f
cpu MHz : 1800.000
cache size : 2048 KB
physical id : 0
siblings : 4
core id : 3
cpu cores : 2
apicid : 19
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
bugs : fxsave_leak sysret_ss_attrs null_seg
bogomips : 7633.47
TLB size : 1536 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm 100mhzsteps hwpstate cpb eff_freq_ro

@doug-moen
Copy link
Member

I assume you are running Linux inside a VM on Windows. You need to enable GPU acceleration in the VM. I assume that any GPU intensive 3D software will have the same problem that Curv is having, in this Linux-on-VM environment, and that this problem is not Curv specific.

VncViewer is just a way to view the video display on a remote machine over a network connection. I guess you'd use it for communicating with programs running inside a VM. I don't expect it to affect the frame rate reported by Curv, but I expect it would add latency and make animations look jumpy.

@doug-moen
Copy link
Member

In the latest version of Curv, curv --version now prints some debug information about the GPU. It shows the GPU model and driver information, as seen by Curv. It is intended to be useful in debugging GPU issues, and should provide some insight into what the GPU looks like from inside the VM.

@gsohler
Copy link
Author

gsohler commented Jul 28, 2018

Doug, I dont run in a VM. I run in native linux when I test it.
Did you notice errors given when running curv without vnc in the picture above ?
it says:

EE r600_shader.c:182 r600_pipe_shader_create - translation from TGSI failed !
EE r600_state_common.c:798 r600_shader_select - Failed to build shader variant (type=1) -1

The new curv -version now outputs:

Curv: 0.1-226-g12dcafe
GPU: X.Org Gallium 0.4 on AMD RS780 (DRM 2.49.0 / 4.11.12-100.fc24.x86_64, LLVM 3.8.0)
OpenGL: 3.0 Mesa 12.0.3

Cheers Guenther

@doug-moen
Copy link
Member

Thanks for the extra information. I'm playing with VNC now, but getting Curv working acceptably over VNC looks hard. I recommend using it in the "normal" way, using a display that is plugged directly into the GPU.

You are using the open source Mesa/Gallium driver for your GPU, at least in the non-VNC case where it is failing. I checked the error message, there's a bug open on Mesa for this, at https://bugs.freedesktop.org/show_bug.cgi?id=99349. As I understand the bug, a workaround in Curv is months of work, since it requires rewriting a significant part of the compiler. It happens if the GLSL code that I generate uses too many registers, so I need to write an optimizing compiler to limit GPU register use.

The VNC code path bypasses this bug, so maybe it's using a slow pure-software renderer instead of using the GPU? That would explain the low frame rate.

My solution is to use the vendor supplied GPU driver, from AMD. Don't use Mesa.
https://support.amd.com/en-us/kb-articles/Pages/AMDGPU-PRO-Driver-for-Linux-Release-Notes.aspx Maybe this driver is already packaged by your distro.

@gsohler
Copy link
Author

gsohler commented Jul 30, 2018

Hi Dough,
Thank you for your elaborations.
I tried to follow your recommendations installing a native AMD GPU-PRO driver.
Looking at the available packages of my Fedora-24 I realized, that that there is a
xorg-x11-drv-amdgpu Package available, which I immediately installed.
However it did not work because i realized, its not the PRO driver.
Your link to the driver looks promising, however I realized, that Fedora is not member of the offered drivers. Also in some newsgroups i read(from 2017) , that AMD did not yet offer a working Fedora driver, maybe this is why.
Unless i can find the source code of such a driver, right now, I dont know, how to continue.
Maybe I have to stay with the VNC solution, until I find a better way.
Thank you so far ...
PS. Glad I found out, curv is working over VNC by accident.

@doug-moen
Copy link
Member

I googled this, and yeah, installing AMDGPU-PRO on Fedora-24 is way too much work. You have to downgrade your kernel and X11, it seems.

@doug-moen
Copy link
Member

The Mesa bug I linked to was fixed last year in Mesa 17.3. You reported running Mesa 12.0.3, which must be quite old. After installing xorg-x11-drv-amdgpu, try running curv --version again and tell me what the output is.

Fedora-24 is very old (June 2016) and reached end-of-life last year. If you can't run the latest AMDGPU-PRO, then running the latest Mesa might help. There is a xorg-x11-drv-amdgpu-18.0.1-1 package for Fedora 28, looks like it contains Mesa 18.0.1.

@gsohler
Copy link
Author

gsohler commented Jul 30, 2018

Hi Dough,

Thank you for your valuable inputs.
Downgrading my x11 and my kernel is too risky for me and not worth improving my curv performance.
I think i will rather upgrade to the latest Fedora 28, instead.

Cheers Günther

@gsohler
Copy link
Author

gsohler commented Aug 1, 2018

After downloading dozens of Gigabytes from the internet, I finally arrived @
Fedora release 28 (Twenty Eight).
This is what /etc/fedora-release tells me
Then I made sure, that i have the
xorg-x11-drv-amdgpu.x86_64
package installed. However, there is no PRO package.
Having that i recompiled it from scratch and now curv tells me
curv --version
Curv: 0.1-226-g12dcafe
GPU: X.Org AMD RS780 (DRM 2.50.0 / 4.17.9-200.fc28.x86_64, LLVM 6.0.0)
OpenGL: 3.0 Mesa 18.0.5

with that, curv still only works with VNC @ 1FPS
without i still get the error:

EE r600_shader.c:3933 r600_shader_from_tgsi - GPR limit exceeded - shader requires 133 registers
EE r600_shader.c:183 r600_pipe_shader_create - translation from TGSI failed !
EE r600_state_common.c:872 r600_shader_select - Failed to build shader variant (type=1) -12

It appears that the number of registers used is dependant on the design. When reducing my design, it works locally displaying "6FPS". However this cannot be true. Trying to be objective its still less than 1 FPS and the mouse does not even move smoothly anymore while the program is run

What could be wrong in my place ?

@doug-moen
Copy link
Member

A few initial comments.

  1. It's a new error message. "GPR limit exceeded - shader requires 133 registers" wasn't being shown before.
    This has been reported as a Mesa bug: https://bugs.freedesktop.org/show_bug.cgi?id=105371
    It's a bug in the register spilling code in the GLSL compiler.
    A fix for this bug was submitted in February (see link), but I guess you would need to download the
    source and build your own version of Mesa.
  2. I'm running Ubuntu 16.04 LTS, with an Nvidia GTX-1050 GPU, using the Nvidia proprietary driver.
    It's an entry level, budget GPU card, but I get 60 FPS for your model.
    When Curv uses too many registers, rendering slows down, but it hasn't failed
    to compile with this driver. (Register spilling works correctly in the Nvidia driver.)

I read up on the difference between the Mesa and AMDGPU-PRO drivers.

  • AMDGPU-PRO is an old, mature driver that has been hacked for maximum compatibility with all sorts of 3D software, including a lot of desktop CAD programs. Due to all of the compatibility hacks, the code is dirty, and not fun to maintain. It's closed source.
  • Mesa is open source; it's new technology, cleaner code, more standards compliant.
    There is hope that in the future, Mesa will be able to replace the older AMDGPU-PRO driver.
    But Mesa is not mature enough yet.

You are running Fedora, which is a "bleeding edge" distro. The advantage is you get to use newer versions of packages. The disadvantage is a higher risk of things not working, and having to deal with that when things break. For example, you have a bad GPU driver. I personally run Ubuntu LTS, which has a strong emphasis on making everything work, with the disadvantage that packages are old. I either live with the older packages, or "side-load" up to date software and install it in /usr/local.

I think you have four options:

  • Download the Mesa source code, apply the bug fix, build the binary and install the driver.
  • Switch to a different distro that supports the AMDGPU-PRO driver.
    That means Red Hat, Suse or Ubuntu LTS.
    The latest version is 18.20, released in June.
  • Switch to an Nvidia GPU with the proprietary driver.
    Instructions here: https://www.if-not-true-then-false.com/2015/fedora-nvidia-guide/
  • Wait for me to rewrite the Curv compiler to reduce GPU register use. It's a 3 month job, though,
    and I've been deferring it because it's more fun to add new features.

@doug-moen doug-moen changed the title Curv performance problem with AMD Gallium/Mesa (open source) GPU driver on Linux Aug 1, 2018
@doug-moen
Copy link
Member

I can't guarantee that the patch for Mesa bug #105371 will fix all of the problems that break Curv.

I looked at the bug fix, and it is not AMD specific. In principle, I should be able to reproduce the problem on my machine using the Nvidia Nouveau driver, which also uses Mesa. (In practice, the Nouveau driver doesn't work, I just get a black screen. I need to upgrade my Nouveau driver before I can make progress on reproduction...)

My ideal solution is for the Mesa project to fix all of the bugs that break Curv. I'm going to see if I can figure out how to install the latest GPU driver from mesa3d.org, without relying on Ubuntu repositories. Then maybe I can start filing bug reports against Mesa.

@gsohler
Copy link
Author

gsohler commented Aug 2, 2018

Hi doug!
1st off all I need to tell you, that i'm impressed by your efforts.
Appearently i am not as fast with trying than your suggestions.

As far as your previuos posts, I decided to recompile and patch the MESA driver.
I could do this with a 'git clone' and a 'git am' with the patches in mbox file format.

I could not yet find out, where curv includes the MESA driver from. 1st installed them in /usr/local,
now trying trying with /usr

... still trying ...

@gsohler
Copy link
Author

gsohler commented Aug 2, 2018

It appears you have temporarily changed your focus and you are looking into the driver issue rather than implementing new feature which is more attractive.
If you feel its useful to test with my hardware, there might be options. In case you are interested, contact me at mail (at ) guenther-sohler.net

Just managed to get curv use the compiled mesa,

curv -version now shows:
Curv: 0.1-226-g12dcafe
GPU: X.Org AMD RS780 (DRM 2.50.0 / 4.17.9-200.fc28.x86_64, LLVM 5.0.1)
OpenGL: 3.0 Mesa 18.2.0-devel (git-a18be3dbc1)

when i try running curv with my capsule wireframe icosahedron, it outputs:

EE r600_shader.c:183 r600_pipe_shader_create - translation from TGSI failed !
EE r600_state_common.c:875 r600_shader_select - Failed to build shader variant (type=1) -12

Now there is no error about too many variables, but the mouse cursor becomes quite unresponsive and there is nothing to see ...

@doug-moen doug-moen added the GPU label Aug 7, 2018
@gsohler
Copy link
Author

gsohler commented Aug 20, 2018

This weekend I had time to install my NVideo GTX 1050 Graphics card into my computer and connecting my display to the new card ultimately displays something useful.
Next step is compile run the NVidia installer in my linux box. It compiles some of their modules against
my linux kernel, which appearently fails. need to look how to proceed.

Guenther

@doug-moen
Copy link
Member

doug-moen commented Aug 20, 2018 via email

@gsohler
Copy link
Author

gsohler commented Sep 14, 2018

i cant comment on this anymore, as i got a good nvidia card now.
thus i can close :)

@gsohler gsohler closed this as completed Sep 14, 2018
@doug-moen
Copy link
Member

This problem seems to be resolved. Curv 0.4 works with the AMD Mesa 19.0.2 driver, according to a report from @ivocavalcante. Here's the relevant output from curv --version for the version that works:

Curv: 0.4
Compiler: gcc 7.4.0
Kernel: Linux 5.0.0-23-generic x86_64
GPU: X.Org, AMD VERDE (DRM 2.50.0, 5.0.0-23-generic, LLVM 8.0.0)
OpenGL: 4.5 (Compatibility Profile) Mesa 19.0.2

This bug was originally reported for Mesa 18.x.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants