Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible incompatibility between Freedesktop's Mesa libraries and Mesa libraries or kernel driver on the host #3673

Open
scx opened this issue Jun 10, 2020 · 21 comments

Comments

@scx
Copy link

scx commented Jun 10, 2020

As you know, for NVIDIA hardware flatpak makes sure that it uses exactly the same driver version as it is on the host. So, for example, if you have NVIDIA 440.64 on the host, it will install the 440.64 driver as a runtime extension. This is because NVIDIA driver requires exactly the same version of the userspace libraries and the kernel driver.
However, when it comes to Mesa drivers, Freedesktop runtime tries to provide a relatively fresh version of libraries, without looking at what is on the host.
For example, Freedesktop 18.08 uses Mesa 19.1.7 and Freedesktop 19.08 uses Mesa 20.0.5. And your operating system of course may use a completely different version, e.g. Mesa 18.3.4. The same applies to the kernel driver. In mid 2019, I was assured that this should never be an issue. However, we strongly believe that this may be a source of potential problems. I suspect that Mesa 20.0.5 from the Freedesktop 19.08 runtime may be somehow incompatible with old Mesa libraries or kernel driver on host, at least when it comes to OpenGL 4.6 support.
Actually, I hit this issue with Widelands on RHEL 7 with an Intel GPU (HD Graphics 630). After about 1 minute (61 seconds), the application terminates itself without any warning message.

This problem does not occur when:

  • we use a NVIDIA GPU with the NVIDIA 440.64 driver (Freedesktop 19.08, EL7).
  • we use the Freedesktop 18.08 runtime with Mesa 19.1.7 (Intel GPU, EL7).
  • we use a native package (RPM) with Mesa 18.3.4 (Intel GPU, EL7).
  • we use a native package (RPM) with NVIDIA 440.64 (NVIDIA GPU, EL7).
  • we use a more recent distribution (e.g. Ubuntu 20.04 LTS) and the Freedesktop 19.08 runtime with Mesa 20.0.5.
  • we use a more recent distribution (e.g. Ubuntu 20.04 LTS) and a native package (DEB) with Mesa 20.0.4.

More details:
widelands/widelands#3937

@alexlarsson
Copy link
Member

@nwnk, whats your opinion here. Can we do better than just assuming whatever mesa version should work with whatever kernel driver version? For nvidia we do extract the kernel driver version and only use the exact matching userland driver.

@nanonyme
Copy link
Contributor

nanonyme commented Jun 15, 2020

It would be interesting to have another test where you have he EL7 era userspace but newer kernel. New Mesa might well tickle some (at that point) less tested and buggy (but since fixed) codepaths in the older kernel.

@nwnk
Copy link

nwnk commented Jun 15, 2020

There probably is a minimum kernel version that a particular version+driver in Mesa would support, but it's likely both ancient and not especially well documented. (And wildly misleading for like RHEL, where the drm gets regular updates despite the "kernel" version being pinned.)

Frankly the behavior described sounds like a bug in Mesa 20, I'm reasonably sure we did not intentionally raise the minimum kernel version for Intel drivers between 19 and 20, and even if we had RHEL7 would not have been a target we'd have dropped.

@scx
Copy link
Author

scx commented Jun 26, 2020

Using a non-standard kernel on a daily basis is not an option for me, so I tried to run an another instance of EL7 with a newer kernel, just for testing purposes. I was unable to boot system with kernel-lt (4.4.227-1.el7.elrepo.x86_64), but fortunately it worked with kernel-ml (5.7.2-1.el7.elrepo.x86_64). However, the mentioned problem still persisted. Then I tried a fresh install of CentOS 7.8. I was able to install and run Widelands 20 (Freedesktop 19.08) under GNOME, so I returned to my RHEL 7 to see what is wrong with it.

Thanks to GDB, I was able to locate the potential source of this issue:

Thread 1 "widelands" received signal SIGSEGV, Segmentation fault.
0x00007ffff43201dc in ?? () from /usr/lib/x86_64-linux-gnu/GL/default/lib/dri/iris_dri.so

Full log:

$  flatpak-builder --run "build" "org.widelands.Widelands.yaml" sh

(flatpak-builder:10688): flatpak-builder-WARNING **: 16:11:15.028: rofiles-fuse not available, doing without
sh-5.0$ gdb
GNU gdb (GDB) 8.3.1
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http:https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http:https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http:https://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) file widelands
Reading symbols from widelands...
Reading symbols from /usr/lib/debug//app/bin/widelands.debug...
(gdb) run
Starting program: /app/bin/widelands 
warning: File "/usr/lib/x86_64-linux-gnu/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /usr/lib/x86_64-linux-gnu/libthread_db-1.0.so
line to your configuration file "/home/scx/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/scx/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
This is Widelands Version Build 21~r24630 (Release)
Set home directory: /home/scx/.var/app/org.widelands.Widelands/data/widelands
Set configuration file: /home/scx/.var/app/org.widelands.Widelands/config/widelands/config
Adding directory: /app/share/widelands
selected language: (system language)
No corresponding locale found
 - Set LANGUAGE, LANG and LC_ALL to 'en'
 - Set system locale to 'en_US.utf8' to make 'en' accessible to libintl
Byte order: little-endian
[New LWP 10]
[New LWP 11]
[New LWP 12]
[New LWP 13]
Graphics: Try to set Videomode 800x600
Graphics: OpenGL: Version "4.6 (Compatibility Profile) Mesa 20.0.5"
Graphics: SDL_GL_RED_SIZE is 8
Graphics: SDL_GL_GREEN_SIZE is 8
Graphics: SDL_GL_BLUE_SIZE is 8
Graphics: SDL_GL_ALPHA_SIZE is 0
Graphics: SDL_GL_BUFFER_SIZE is 24
Graphics: SDL_GL_DOUBLEBUFFER is 1
Graphics: SDL_GL_DEPTH_SIZE is 24
Graphics: SDL_GL_STENCIL_SIZE is 8
Graphics: SDL_GL_ACCUM_RED_SIZE is 0
Graphics: SDL_GL_ACCUM_GREEN_SIZE is 0
Graphics: SDL_GL_ACCUM_BLUE_SIZE is 0
Graphics: SDL_GL_ACCUM_ALPHA_SIZE is 0
Graphics: SDL_GL_STEREO is 0
Graphics: SDL_GL_MULTISAMPLEBUFFERS is 0
Graphics: SDL_GL_MULTISAMPLESAMPLES is 0
Graphics: SDL_GL_ACCELERATED_VISUAL is 1
Graphics: SDL_GL_CONTEXT_MAJOR_VERSION is 2
Graphics: SDL_GL_CONTEXT_MINOR_VERSION is 1
Graphics: SDL_GL_CONTEXT_FLAGS is 0
Graphics: SDL_GL_CONTEXT_PROFILE_MASK is 0
Graphics: SDL_GL_SHARE_WITH_CURRENT_CONTEXT is 0
Graphics: SDL_GL_FRAMEBUFFER_SRGB_CAPABLE is 0
Graphics: OpenGL: Double buffering enabled
Graphics: OpenGL: Max texture size: 16384
Graphics: OpenGL: ShadingLanguage: "4.60"

Thread 1 "widelands" received signal SIGSEGV, Segmentation fault.
0x00007ffff43201dc in ?? () from /usr/lib/x86_64-linux-gnu/GL/default/lib/dri/iris_dri.so
(gdb) quit
A debugging session is active.

	Inferior 1 [process 6] will be killed.

Quit anyway? (y or n) y
sh-5.0$ exit
exit

It is probably related to this bug in Mesa (Iris driver):
https://bugs.launchpad.net/ubuntu/+source/mesa/+bug/1877879
https://bugs.freedesktop.org/show_bug.cgi?id=111376
https://gitlab.freedesktop.org/mesa/mesa/-/issues/1358

Actually, we have plenty of issues related to the "iris" driver.
https://gitlab.freedesktop.org/mesa/mesa/-/issues?label_name%5B%5D=iris

This one occurs only on Intel Graphics with the "intel" driver (not "modesetting").

When I tried to run Widelands with MESA_LOADER_DRIVER_OVERRIDE=i965, it just worked.

Log:

$ flatpak-builder --run "build" "org.widelands.Widelands.yaml" sh

(flatpak-builder:11175): flatpak-builder-WARNING **: 16:15:40.605: rofiles-fuse not available, doing without
sh-5.0$ MESA_LOADER_DRIVER_OVERRIDE=i965 widelands 
This is Widelands Version Build 21~r24630 (Release)
Set home directory: /home/scx/.var/app/org.widelands.Widelands/data/widelands
Set configuration file: /home/scx/.var/app/org.widelands.Widelands/config/widelands/config
Adding directory: /app/share/widelands
selected language: (system language)
No corresponding locale found
 - Set LANGUAGE, LANG and LC_ALL to 'en'
 - Set system locale to 'en_US.utf8' to make 'en' accessible to libintl
Byte order: little-endian
Graphics: Try to set Videomode 800x600
Graphics: OpenGL: Version "3.0 Mesa 20.0.5"
Graphics: SDL_GL_RED_SIZE is 8
Graphics: SDL_GL_GREEN_SIZE is 8
Graphics: SDL_GL_BLUE_SIZE is 8
Graphics: SDL_GL_ALPHA_SIZE is 0
Graphics: SDL_GL_BUFFER_SIZE is 24
Graphics: SDL_GL_DOUBLEBUFFER is 1
Graphics: SDL_GL_DEPTH_SIZE is 24
Graphics: SDL_GL_STENCIL_SIZE is 8
Graphics: SDL_GL_ACCUM_RED_SIZE is 0
Graphics: SDL_GL_ACCUM_GREEN_SIZE is 0
Graphics: SDL_GL_ACCUM_BLUE_SIZE is 0
Graphics: SDL_GL_ACCUM_ALPHA_SIZE is 0
Graphics: SDL_GL_STEREO is 0
Graphics: SDL_GL_MULTISAMPLEBUFFERS is 0
Graphics: SDL_GL_MULTISAMPLESAMPLES is 0
Graphics: SDL_GL_ACCELERATED_VISUAL is 1
Graphics: SDL_GL_CONTEXT_MAJOR_VERSION is 2
Graphics: SDL_GL_CONTEXT_MINOR_VERSION is 1
Graphics: SDL_GL_CONTEXT_FLAGS is 0
Graphics: SDL_GL_CONTEXT_PROFILE_MASK is 0
Graphics: SDL_GL_SHARE_WITH_CURRENT_CONTEXT is 0
Graphics: SDL_GL_FRAMEBUFFER_SRGB_CAPABLE is 0
Graphics: OpenGL: Double buffering enabled
Graphics: OpenGL: Max texture size: 16384
Graphics: OpenGL: ShadingLanguage: "1.30"
**** GRAPHICS REPORT ****
 VIDEO DRIVER GLVND x11
 pixel fmt 370546692
 size 800 600
**** END GRAPHICS REPORT ****
Style Manager: Reading style templates took 55ms
**** SOUND REPORT ****
SDL version: 2.0.12
SDL_mixer version: 2.0.4
**** END SOUND REPORT ****
Songset: Loaded song "music/intro.ogg"
SoundHandler: Closing 1 time, 22050 Hz, format 32784, 2 channels
SoundHandler: SDL_AUDIODRIVER pulseaudio
sh-5.0$ exit
exit

GDB output:

$ flatpak-builder --run "build" "org.widelands.Widelands.yaml" sh

(flatpak-builder:11175): flatpak-builder-WARNING **: 16:15:40.605: rofiles-fuse not available, doing without
sh-5.0$ export MESA_LOADER_DRIVER_OVERRIDE=i965
sh-5.0$ gdb
GNU gdb (GDB) 8.3.1
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http:https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-unknown-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http:https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http:https://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word".
(gdb) file widelands
Reading symbols from widelands...
Reading symbols from /usr/lib/debug//app/bin/widelands.debug...
(gdb) run
Starting program: /app/bin/widelands 
warning: File "/usr/lib/x86_64-linux-gnu/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /usr/lib/x86_64-linux-gnu/libthread_db-1.0.so
line to your configuration file "/home/scx/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/scx/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
This is Widelands Version Build 21~r24630 (Release)
Set home directory: /home/scx/.var/app/org.widelands.Widelands/data/widelands
Set configuration file: /home/scx/.var/app/org.widelands.Widelands/config/widelands/config
Adding directory: /app/share/widelands
selected language: (system language)
No corresponding locale found
 - Set LANGUAGE, LANG and LC_ALL to 'en'
 - Set system locale to 'en_US.utf8' to make 'en' accessible to libintl
Byte order: little-endian
[New LWP 27]
[New LWP 28]
[New LWP 29]
[New LWP 30]
Graphics: Try to set Videomode 800x600
Graphics: OpenGL: Version "3.0 Mesa 20.0.5"
Graphics: SDL_GL_RED_SIZE is 8
Graphics: SDL_GL_GREEN_SIZE is 8
Graphics: SDL_GL_BLUE_SIZE is 8
Graphics: SDL_GL_ALPHA_SIZE is 0
Graphics: SDL_GL_BUFFER_SIZE is 24
Graphics: SDL_GL_DOUBLEBUFFER is 1
Graphics: SDL_GL_DEPTH_SIZE is 24
Graphics: SDL_GL_STENCIL_SIZE is 8
Graphics: SDL_GL_ACCUM_RED_SIZE is 0
Graphics: SDL_GL_ACCUM_GREEN_SIZE is 0
Graphics: SDL_GL_ACCUM_BLUE_SIZE is 0
Graphics: SDL_GL_ACCUM_ALPHA_SIZE is 0
Graphics: SDL_GL_STEREO is 0
Graphics: SDL_GL_MULTISAMPLEBUFFERS is 0
Graphics: SDL_GL_MULTISAMPLESAMPLES is 0
Graphics: SDL_GL_ACCELERATED_VISUAL is 1
Graphics: SDL_GL_CONTEXT_MAJOR_VERSION is 2
Graphics: SDL_GL_CONTEXT_MINOR_VERSION is 1
Graphics: SDL_GL_CONTEXT_FLAGS is 0
Graphics: SDL_GL_CONTEXT_PROFILE_MASK is 0
Graphics: SDL_GL_SHARE_WITH_CURRENT_CONTEXT is 0
Graphics: SDL_GL_FRAMEBUFFER_SRGB_CAPABLE is 0
Graphics: OpenGL: Double buffering enabled
Graphics: OpenGL: Max texture size: 16384
Graphics: OpenGL: ShadingLanguage: "1.30"
**** GRAPHICS REPORT ****
 VIDEO DRIVER GLVND x11
 pixel fmt 370546692
 size 800 600
**** END GRAPHICS REPORT ****
Style Manager: Reading style templates took 9ms
**** SOUND REPORT ****
SDL version: 2.0.12
SDL_mixer version: 2.0.4
**** END SOUND REPORT ****
[New LWP 31]
[New LWP 32]
Songset: Loaded song "music/intro.ogg"
SoundHandler: Closing 1 time, 22050 Hz, format 32784, 2 channels
SoundHandler: SDL_AUDIODRIVER pulseaudio
[LWP 32 exited]
[LWP 30 exited]
[LWP 29 exited]
[LWP 28 exited]
[LWP 27 exited]
[LWP 31 exited]
[Inferior 1 (process 23) exited normally]
(gdb) quit
sh-5.0$ exit
exit

It should be noted that I use the "intel" driver instead of "modesetting". This is mainly because I have to use the "TearFree" option, which isn't available for the general driver.
https://bugs.freedesktop.org/show_bug.cgi?id=98876#c2
https://gitlab.freedesktop.org/xorg/xserver/-/issues/244

While the current version of marco (MATE's window manager) has support for XPresent, EPEL7 provides only MATE 1.6, which was released in 2016.
mate-desktop/marco#350
https://src.fedoraproject.org/rpms/marco/blob/5cbec0b3fe31090ffe6391c4df1276e534f55de5/f/marco.spec#_5

What is worse, the XPresent extension could work only with DRI3, and I have to use DRI2 because of other issues.

So, even with experimental packages of MATE 1.18, it couldn't work, and they still have some issues with MATE applets.
https://copr.fedorainfracloud.org/coprs/raveit65/Mate-GTK3/
https://copr.fedorainfracloud.org/coprs/raveit65/mate-desktop-extra/

In the past, I've tried to use Compton, but besides it didn't fully solve the problem with tearing, it had very poor integration with MATE (Alt-Tab window previews, Workspace Switcher, shadows under window borders, etc.).

Anyway, this is my X.Org configuration when it comes to the Intel GPU:

Section "Device"
        Identifier  "Intel"
        #Driver     "modesetting"
        Driver      "intel"
        BusID       "PCI:0:2:0"
        ###
        Option      "TearFree"              "True"
        #Option     "DRI"                   "3"
        Option      "DRI"                   "2"
        Option      "AccelMethod"           "glamor"
EndSection

To summarize, this issue is probably related to a Mesa bug, and it should be fixed in Freedesktop 19.08.

@Erick555
Copy link
Contributor

Yes, this isn't flatpak issue so you can close this.

@scx
Copy link
Author

scx commented Jun 27, 2020

@scx
Copy link
Author

scx commented Jun 28, 2020

@scx
Copy link
Author

scx commented Jun 28, 2020

@Erick555

Yes, this isn't flatpak issue so you can close this.

Since it occurs only with the intel DDX driver (not modesetting), it still may be somehow related to the X.Org version.
Anyway, this error may be involved by several problems.

@Erick555
Copy link
Contributor

Flatpak uses X server from host so eventual bug in xorg isn't flatpak issue to resolve.

@scx
Copy link
Author

scx commented Jun 28, 2020

@Erick555

Flatpak uses X server from host so eventual bug in xorg isn't flatpak issue to resolve.

The DRI driver from the host works fine with the DDX driver and X.Org Server, so it's hard to blame the distribution about this issue.
If there is a potential incompatibility between the Freedesktop and host components, this is a flatpak problem IMHO. Actually, it is what this report is about.
I didn't have time to check if the problem still persists on the latest Fedora, so I can't tell whether it is or not a flatpak issue.
Feel free to borrow my X.Org configuration and verify it on your own if you are so impatient.

@Erick555
Copy link
Contributor

Erick555 commented Jun 28, 2020

This may be bug in mesa, ddx, sdl, xorg or whatever but neither scenario suggest it's a bug in flatpak itself which is my point. This is wrong place for this issue. Youe even said it yoursef so I don't understand why you argue when I just agreed with you:

To summarize, this issue is probably related to a Mesa bug, and it should be fixed in Freedesktop 19.08.

Mesa on host is irrelevant and kernel dev ruled out kernel issue.

@scx
Copy link
Author

scx commented Jun 28, 2020

This may be bug in mesa, ddx, sdl, xorg or whatever but neither scenario suggest it's a bug in flatpak itself which is my point. This is wrong place for this issue. Youe even said it yoursef so I don't understand why you argue when I just agreed with you:

To summarize, this issue is probably related to a Mesa bug, and it should be fixed in Freedesktop 19.08.

Mesa on host is irrelevant and kernel dev ruled out kernel issue.

At this point, we don't know where exactly the source of this error is located. It could be caused by many components, including kernel, Mesa, X.Org, glibc, libstdc++, LLVM, etc.
Just a few hours ago, I was pretty sure this bug was located in Mesa, because I was able to reproduce it in both EL 7 and Fedora 30. But now, I really don't know, because someone was able to run test apps without any problem on Arch, and he used exactly the same X.Org configuration as mine.
What if e.g. some of Mesa driver are somehow incompatible with the older X.Org Server, at least in some situations (e.g. when using the intel DDX driver and DRI2)? Linux distros support only packages they provide. They won't care about Mesa provided by Freedesktop.
For sure, there is something wrong in Mesa 19/20 and Freesktop 19.08. But maybe, just maybe, it is only the tip of the iceberg.
If Mesa driver from Freedesktop can run properly on one system and crash on another, then maybe flatpak should provide mechanisms to prevent such situations. Maybe we need an additional layer between Mesa and X.Org. At this point, I don't really know.

@scx
Copy link
Author

scx commented Jun 28, 2020

kernel dev ruled out kernel issue.

This is quite interesting, because Lionel Landwerlin (member of Mesa, DRM and X.Org) stated that Iris requires a kernel 4.16 or higher.
https://gitlab.freedesktop.org/mesa/mesa/-/issues/2845#note_478079

In FAQ we have:

Please note that Iris requires a modern kernel, such as 4.19+.

https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/283#qa

Someone else suggested that it may require even a 5.x+ kernel.
https://linuxreviews.org/Intel_Iris

@nanonyme
Copy link
Contributor

@scx that seems highly problematic. Sounds like Iris should be somehow be opt-in if it's breaking compatibility expectations to this extent.

@nanonyme
Copy link
Contributor

nanonyme commented Jul 8, 2020

Okay, so current plan is to --prefer-iris=false in Mesa at least for 19.08 and this is going to be released soon. If the intent is to support CentOS 7, then the i965->iris migration is really a no-go until there are host-side fixes. This is a clear regression during runtime support cycle and is clearly not okay. We're discussing how to handle 20.08 which is about to be released.

@scx
Copy link
Author

scx commented Jul 10, 2020

Issues

There are several issues related to the iris driver.

I. For most distributions, it doesn't work when DRI2 and the intel DDX driver are in use. At this moment, even Fedora Rawhide and openSUSE Tumbleweed are affected. Basically, everything that uses xf86-video-intel (xorg-x11-drv-intel) older than 2 months (or without the DRI2ATTACH_FORMAT patch) is affected.

Thread 1 "widelands" received signal SIGSEGV, Segmentation fault.
0x00007ffff3fb71dc in ?? () from /usr/lib/x86_64-linux-gnu/GL/default/lib/dri/iris_dri.so
[ 1799.091114] widelands[8374]: segfault at 24 ip 00007fd9d4aa01dc sp 00007ffedc2ec850 error 6 in libgallium_dri.so[7fd9d409b000+cfd000]

I've already reported this problem in EL and Fedora:

I also created a COPR repo with patched xf86-video-intel (xorg-x11-drv-intel) for EL 6-8 and Fedora 30-33:

However, I doubt that Ubuntu LTS or RHEL will get patches anytime soon. It should be noted that this configuration is quite popular on EL7, due to a lack of other options to prevent screen tearing outside GNOME3.

Maybe we should consider performing additional checks to verify if xf86-video-intel (xorg-x11-drv-intel) has been fixed on the host. It could be done in Mesa, Freedesktop or in Flatpak itself.

See:

II. The iris driver cashes on Linux < 4.16 (at least without backports related to iris in i915_drm). Some older LTS distributions are affected. However, Red Hat provided backports to the EL7 kernel.

Thread 1 "widelands" received signal SIGABRT, Aborted.
0x00007ffff752d605 in raise () from /usr/lib/x86_64-linux-gnu/libc.so.6

Some time ago Mesa provided the code that should handle such a situation and fallback to software rendering in case of failure. It was introduced in Mesa 20.1.0 and then backported to Mesa 20.0.6. However, this code is simply broken. Instead of fallbacking to software rendering, it just crashes. Even the latest release (Mesa 20.1.3) is affected. However, it may be fixed upstream in the future release.

Anyway, at this moment, both Mesa 20.0.5 from flathub and 20.1.2 from freedesktop-sdk are affected - the iris driver just crashes.

It should be noted that Mesa is trying to fallback to swrast in case of failure. It is not a optimal option for Gen8-Gen11 hardware. In such scenario, it should try the i965 driver first. But even on Gen12, it could be done better. As you may know, swrast (the legacy Mesa software rasterizer) is much slower than llvmpipe (which is multi-threaded and uses LLVM for x86 JIT code generation) or even swr (x86-optimized software renderer). In my opinion, swrast should be a last resort option. However, there is a small catch here. For Tiger Lake, we can assume that CPU would be capable of fast software rendering using llvmpipe. However, Intel plans to release their Gen12-based dGPU (Intel Xe), which can also work on architectures other than x86, and llvmpipe is optimized only for x86 and PPC.

See:

III. Some time ago Mesa released the distro packaging guidelines for Intel/Iris. They literally recommend "upgrading to xserver 1.20.7 and libepoxy 1.5.4 before shipping iris". They also explicitly recommend to not use iris without it. It may work, but nothing is guaranteed. We should ask ourselves if we are able to test every single app on Flathub using an older X.Org release.

There was a discussion about this in Ubuntu. There are experimental Mesa 20 packages for Ubuntu 18.04 LTS in bionic-proposed, and they explicitly prefer i965 over iris in debian/rules.

  # Build intel drivers on archs where libdrm-intel is installed
  ifneq (,$(filter $(DEB_HOST_ARCH),amd64 i386 kfreebsd-amd64 kfreebsd-i386 x32))
	DRI_DRIVERS += i915, i965,
	GALLIUM_DRIVERS += iris,
	confflags_GALLIUM += -Dprefer-iris=false
  endif

Commit date: 2020-06-17

I think we should do the same if we really care about older distributions.

See:

Solutions

There are a few possible solutions for these problems.

I. Do not ship the iris driver at all. This is exactly what EL7, EL8 and Ubuntu 18.04 LTS do now. However, Mesa states that "the plan is for i965 to stop receiving new hardware support at Gen11, so for Tigerlake/Gen12, iris will be the only driver". As you can see, this is not a long-run option for us.

See:

II. Build Mesa with the -Dprefer-iris=false option, to prefer i965 over iris when possible (on Intel Gen8-11 hardware).

This is probably what we are gonna to do in short term. This will require some fixes for Tiger Lake, e.g. upgrading Mesa to a version that will appear in the future to make proper fallback in case of older kernel.

However, we should allow users to choose iris if they really want to. It is pretty simple with classic packages: you can use the MESA_LOADER_DRIVER_OVERRIDE environment variable or a drirc file. However, these things are basically ignored by flatpak. First of all, flatpak packages don't have access to the /etc/drirc file from host. Moreover, most games are completely sandboxed when it comes to the filesystem, so they can't read ~/.drirc neither.
Anyway, I think it requires some work at flatpak side.

See:

III. Perform additional check to select the best possible driver.

It could look like this:

  • Gen >= 12: try iris, then software rendering
    • If DRI2 and the intel DDX driver are in use, check if xf86-video-intel (xorg-x11-drv-intel) has been patched. If yes, try iris. If not, fallback to software rendering.
    • On new kernels, try use iris. If an older kernel is in use (Linux < 4.16, without backports related to iris in i915_drm), fallback to software rendering. We should look for at least I915_PARAM_HAS_EXEC_NO_RELOC, I915_PARAM_HAS_EXEC_HANDLE_LUT, I915_PARAM_HAS_EXEC_BATCH_FIRST, I915_PARAM_HAS_EXEC_FENCE_ARRAY and I915_PARAM_HAS_CONTEXT_ISOLATION.
  • Gen >= 8 and Gen < 12: try iris, then i965, then software rendering
    • If DRI2 and the intel DDX driver are in use, check if xf86-video-intel (xorg-x11-drv-intel) has been patched. If yes, try iris, if not, then fallback to i965. If it fails as well, then fallback to software rendering.
    • If an older kernel is in use (Linux < 4.19, without backports related to iris in i915_drm), fallback to i965 (according to Mesa: "There is a known kernel memory leak in some 4.18 kernels, which Iris can provoke. The fix has already been backported, but a few distros didn't pick it up before moving on to 4.19+."). If it fails as well, try to run iris if kernel has support for I915_PARAM_HAS_EXEC_NO_RELOC, I915_PARAM_HAS_EXEC_HANDLE_LUT, I915_PARAM_HAS_EXEC_BATCH_FIRST, I915_PARAM_HAS_EXEC_FENCE_ARRAY and I915_PARAM_HAS_CONTEXT_ISOLATION. In case of a failure, fallback to software rendering.
    • If X.Org >= 1.20.7 is in use, try iris. If not, then fallback to i965. If it fails as well, try to run iris. In case of a failure, fallback to software rendering.
  • Gen >= 4 and Gen < 8: i965, then software rendering
    • Try to use i965. In case of a failure, fallback to software rendering.
  • Gen >= 2 and Gen < 4: i915, then software rendering
    • Try to use i915. In case of a failure, fallback to software rendering.
  • Gen < 2
    • It isn't handled by the i915 kernel module at all. I doubt that anyone uses it anymore, especially with flatpak. It can be handled by i810, but it doesn't support modern OpenGL anyway.

As for software rendering: try llvmpipe on modern x86 or PPC, swr on older x86, and swrast as a last resort.

Anyway, it could be done in Mesa, Freedesktop or in Flatpak itself.

Please keep in mind that iris is a relatively new driver, and just in a month we found several different issues. It's quite possible that there are more of them, but we don't know about them yet. Maybe we should performs some tests to check if iris can work on specific host? This thing should be actually done in flatpak. When to run it? There are several options:

  • During flatpak update. IMHO, it is bad idea. User can run system with different kernel or log-in in a different desktop environment that uses Wayland.
  • During system startup. Again, bad idea because user can switch from e.g. GNOME on X.Org to GNOME in Wayland.
  • After logging into a desktop environment. Not bad, but different apps may use different runtimes, with different Mesa versions.
  • Before first run the app. This is how it works for font cache. IMHO, it is the best moment to test it.

What could such a test look like? It could be a pretty simple windowless program (there are several ways to create a dummy window, e.g. glfwWindowHint(GLFW_VISIBLE, GL_FALSE) in GLFW) that creates the graphics context (e.g. glfwMakeContextCurrent in GLFW, SDL_GL_CreateContext in SDL2) and tries to perform one operation, e.g. glClear(GL_COLOR_BUFFER_BIT). However, there is a small catch here. If something goes wrong, it could hang out the program for about one minute. Maybe the problem could be detected faster when using threads.

I know, it is a dirty hack, but if you really want to support a whole range of distributions, from 5 years old to bleeding-edge distros, you can either test them all or provide some mechanisms that will do it for you. The current approach just doesn't work well.

See:

@Erick555
Copy link
Contributor

Erick555 commented Jul 10, 2020

However, we should allow users to choose iris if they really want to. It is pretty simple with classic packages: you can use the MESA_LOADER_DRIVER_OVERRIDE environment variable or a drirc file. However, these things are basically ignored by flatpak

Is MESA_LOADER_DRIVER_OVERRIDE really ignored by flatpak? I just checked and flatpak run --env=MESA_LOADER_DRIVER_OVERRIDE=xxx app.id works. I think this is pretty simple solution for users to change drivers.

@nanonyme
Copy link
Contributor

It should work if not suppressed by app. I just checked and there are no runtime-level suppressions.

@nanonyme
Copy link
Contributor

nanonyme commented Jul 10, 2020

There is now newer GL extension published for 19.08 which will prefer i965 on hardware supported by both i965 and iris. If issues still persist, please report exact GPU used so we can determine if it's supposed to be supported by i965 at all.

@nanonyme
Copy link
Contributor

20.08 will also possibly be released with --prefer-iris=false depending on whether distro support for iris has improved before it's out.

@nanonyme
Copy link
Contributor

nanonyme commented Aug 25, 2020

This has apparently resulted in even more breakage now. Apparently if you have Iris on host but i915 in guest, there may be crashes. So this preferring Iris to make things work on RHEL may be resulting in more and more crashes on newer distros. Also if you should not combine the two, simple configuration switch most definitely isn't enough as a solution for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants