Page MenuHomePhabricator

Fullscreen DXVK-window rendering problem
Open, HighPublic

Description

CPU: AMD 5600X, GPU: GTX 1070, OS: debian/sid, with EFL & Enlightenment compiled from git.

When I start Elite:Dangerous (game running under wine with DXVK) I get a completely black fullscreen window that stays black. The game is running, I can hear the intro sounds. When I toggle the fullscreen mode the window gets updated once, and it now only shows that one rendered frame. The game keeps running, as I can hear the menu sounds when I move my mouse cursor around.
This does not just happen with E:D, but several other steam (proton) games show the same problem.

While trying to video capture the problem I found a workaround:
When OBS is doing a window capture (XComposite) of the game window, and I then toggle the pinned state (using windows+up keybind) the game window starts rendering normally.
If it helps here is a short (24 second) video capture. Top left is the screen capture (XSHM) and it matches what I see on the monitor, bottom right is the game window capture (XComposite).
Sometimes during game play the game screen freezes again, while OBS is still sees the game window rendering/updating. Toggling the pinned state again, often fixes it again.

I know it's caused by updating my nvidia drivers. 440.100 worked fine, newer versions so far all have this problem.
Unfortunately running the old 440.100 nvidia drivers is no longer an option for me, as it does not compile with the new kernels I need for my 5600X (specifically the motherboard audio and network drivers).

I thought I'd still report it here as I'm only seeing this with E, and not when using XFCE, Gnome or KDE (and I'd prefer to keep using E, especially with this nice new flat theme).
I'd appreciate some help with this, as I don't know how to debug / analyse this further.
But if you close this ticket because of it being a nvidia driver issue, I understand.

pwerken created this task.Apr 14 2021, 12:03 PM
ProhtMeyhet triaged this task as Pending on user input priority.EditedApr 18 2021, 8:49 AM
ProhtMeyhet added a subscriber: ProhtMeyhet.

These are possible workarounds and the answer to these may help to find if there is a bug.

  1. Save your work just in case. Might crash, won't, but be on the safe side because Enlightenment will have to restart.
  2. Make a backup copy of your settings folder, just in case. Location:
~/.e
  1. In Settings -> "Composite" click "Advanced" (lower right)
  2. In the tab "Rendering" you can try 3 options - on and off - and see if any of them help:
  3. "Don't composite fullscreen windows"
  4. "Don't fade backlight"
  5. "Tear free updates (VSynced)"

Please then report back what is currently your settings and if something there helped. The "Assume swapping method" is just there if you have performance issues and usually just with newer hardware. As such it is unlikely that the problem lies there.

Thank you for helping.

The Composite settings dialog show:

  • Don't composite fullscreen window : on
  • Don't fade backlight : off
  • Tear free updates (VSynced) : on

I tried changing each of these individually, testing if the game renders normally, but no luck. I've also tried the remaining 4 other permutations, with no change.
For now I've left all these options on.

While repeatedly toggling the game window between "maximize" en "unmaximize" using the e-keybinding I noticed somethings...

Most of the time (like 9 out of 10 times) my shelves are visible in both states. My shelves are set "below windows".

  • with the game window in "maximize" state, it doesn't overlap with my shelves (= as expected, and the clock and cpufreq update as normal). But the rendered frame looks as if it was rendered for a full screen sized window (i.e. it's not squished or anything, it has the proper aspect ratio).
  • with the game window in "unmaximized" state: the frame shows the game scene slightly squished, as if it for the maximize window size (= the full screen minus the shelves a the bottom). The shelves are also still visible but they don't update (clock and cpufreq frozen).

It is as if the frame is for the previous window size instead of the current window size.

On some rare occassion, after a long time of toggling, the game window starts renders as normal. If I toggle it back to "unmaximized" it goes back to the broken state.

It feels a bit like there's some kind of race condition going on.
Is there any logging I can enable that might help figure this out?

TwoD added a subscriber: TwoD.Tue, Sep 7, 1:44 AM

I have had similar, if not the same, issues for months with Elite Dangerous, Doom Eternal, and pretty much any other game doing rendering, and Chrome/Chromium on all my machines running Enlightenment. Most often it happens when resizing and playing a YouTube video. The whole window rendering locks up and I have to resize again to make a few more frames render and keep doing it until it continues normally. If it gets really bad I have to reload with Ctrl+Alt+End and hope it starts rendering again, but that often ends up in a reload loop and won't stop fading back to black until I switch to a vconsole and back, which in turn causes a full lockup and if E itself doesn't force a restart I have to send it a SIGKILL (thanks you enlightenment_start for keeping my apps alive!).

Juggled all rendering options I've found but nothing appears to help long term. The other day I tried forcing it to do double-buffering and that seemed to have fixed it for Chrome. It was rendering stable for days until I came in today and it's freezing on resize again. Verified double buffering was still enabled, flipped to triple buffering and back, restarted Chrome, still the same. Reloaded E with Ctrl+Alt+End, same.

My uptime is currently 20 days on this machine but it doesn't appear to be something which depends on that as it also does it right after a reboot. Could it be multi-monitor related?

All my machines have the latest nVidia driver and run Arch, usually with E and EFL from git, but it happened in the latest releases too. All machines runnning under X due to nNvidia, not tried Wayland mode. It has happened with both a 950GTX and a 970GTX, going to try a RTX3060 and see if that series has the same issue.

Just for giggles I tried out KDE Plasma for comparison and does appear to be a lot more stable, at least when playing games, but I did get similar issues for a short while in Chrome with YouTube so it can happen there too - just a lot harder to reproduce. I also have other issues with KDE and closed windows becoming semi-transparent ghosts so maybe that indicates it's driver/X related?

I recently also noticed that if I share my screen in Teams I can't click anything on that shared screen. Keyboard input still works but the overlay with a red square around that screen seems to block input. Killing Teams kept that overlay and still prevented input even after reloading with Ctrl+Alt+End.

I tried looking through logs to see if there are any errors but not sure I have a high enough loglevel or where to look. My ~/.xsession-errors is empty but not sure I have the correct loglevels set.

ProhtMeyhet raised the priority of this task from Pending on user input to High.Wed, Sep 8, 6:55 AM
ProhtMeyhet added a subscriber: raster.

so it seems this happens in e and kde. it seems to hang about more stubbornly in e but the core issue is the same. that to me says "time to talk to nvidia". you could try the nouveau drivers - but YMMV there. another option is to switch to another gpu (eg amd or intel etc.). fyi the double/triple buffering doesn't change buffering - it changes what evas ASSUMES the buffering mode is for calculating partial updates. the best is ti leave it on auto. there is no setting in e/evas to specifically use single, double, triple etc. buffering. it will thus use whatever the driver layer thinks is right (normally triple buffering).

you could also try toggling the "don't composite fullscreen windows". this has e turn off compositing when a window is detected that fills the entire screen - i have noticed drivers seem to be iffy on this and sometimes just stop updating that window. YMMV based on driver.

your teams issue - i don't know but ctrl+alt+end would have killed off the e process and restarted it so you are looking at some other issue outside of e... as everything e had created would have been destroyed when e restarted. :|

so .. nouveau - and if that doesn't help.. another gpu.

TwoD added a comment.EditedWed, Sep 8, 2:40 PM

Thanks for the clarification.

I have tried toggling compositing for fullscreen windows, but if it wasn't clear this was also happening on non-fullscreen windows for me.

Found a reliable way to trigger the problem using Chrome; just hit F11 on any tab and it happens almost every time in E but never anymore in KDE. Hitting F on a YouTube video is less likely to cause it to happen but hammering it for a while usually does. Perhaps a bit more likely if there are multiple windows open but not sure. Did these tests with a single screen first on the 970 GTX and then on the RTX 3060.
The 970 I could also try with the nouveau driver and did not experience any issues with Chrome other than fullscreening being a bit more "abrupt" than in Plasma.
There's no support for the 3060 in nouveau so could't test with that.

I was unable to find any games which launched properly on nouveau except one, which crashed soon after so couldn't really compare.

On both cards on KDE I could launch Elite Dangerous and Doom Eternal and see the intro screen/video. When launching them on E the screen was always black until I toggled showing the desktop with Ctrl+Alt+D a few times, then the games were updating smoothly.
The games were running as "fullscreen" borderless windows, and attempting to toggle actual fullscreening on them with Ctrl+Alt+F just made them flash and they stayed black. When I did get them to render properly after Ctrl+Alt+D toggling Alt+Tab was misbehaving instead. The window switcher comes up but selecting anything other than the fullscreen game would just make it jump back up. I've got focus follows mouse on so that maybe interferes, but it's not set to raise windows until clicking. Toggling back to the desktop with Ctrl+Alt+D made Alt+Tab work again, until I toggled desktop again to get the game to update, so it was one or the other.

I was able to screenshot the frozen frame when Chrome does it.

It appears to clone or re-draw the window in the top left corner but in the original size and then stops drawing. Hitting F11 (or F when on YT) again a few times makes it fullscreen properly.

If Chrome is already the size of the screen (minus the bottom bar, so I guess technically it's maximized, not fullscreen) and I press F to fullscreen a video it also freezes, pressing F again updates the window and tries to go back to the non-fullscreen player but the whole window freezes again. Usually I have to make Chrome not fullscreen and maybe resize it to get it to redraw anything again.

While testing/provoking/bashing E with as much stuff I could throw at it it crashed twice.
The first time was on nvidia when pressing Ctrl+Alt+F2 and then went back with Ctrl+Alt+F1, E went black. I could still hear sound playing from at least two videos but nothing was rendering (except maybe for the cursor, forgot). Went back to the console with Ctrl+Alt+F2 and sent a SIGKILL to E. Then it redrew with Guru Meditation error #00793109.00000011, the open windows behind it (no decorations), and produced this dump

. Clicking to restart worked and I was unable to reproduce by switching to the console again.

The second crash was on nouveau and happened when rapidly fullscreening two windows playing YouTube videos (hammering F and moving the cursor between them). That time it booted me out to the console so I guess it took enlightenment_start with it.

Tried to reproduce but couldn't.

PS: Forgot to mention I have the compositor settings back to defaults, no manual X11 config file and both EFL and E were updated from master just before testing this. For E I disabled the Wayland flag as instructed in the PKGBUILD file and left the EFL one untouched.

raster added a comment.Wed, Sep 8, 4:28 PM

so the crashdumps do seem to be an e problem. will get back to that later.

your "not rendering" i suspect is an nvidia driver bug of some sort - it just so happens that e just triggers it very easily. e doesn't handle nvidia differently to anything else. the fact it works on nouveau already indicates that everything else being the same - driver is one big differentiating factor to help cause it. i would bet it'd work fine on intel, amd too (certainly does for me - fullscreen youtube in crhome or anything in chrome, steam games - lots of them etc. i know work fine on all my intel/amd gpu systems). i have no more nvidia cards as i moved away from them due to their binary drivers, impossibility to debug them when things don't work and their lack of support for wayland (gbm). you have some problem that happens inside a binary proprietary blob that cannot be inspected or figured out. the e/efl side is all oss and if you look you'll see it handles it all the same way as when it does work.

the way compositing works is simply a x11 pixmap is transformed to an eglimage (the eglimage is created then an egl call maps the pixmap to the image). then that eglimage is used as a source when rendering a texture that represents the window inside the canvas. this works on vc4/6 (rpi2/3/4), lima (mali-4xx), panfrost (mali t6/7/8xx, mali g-xx (bifrost)), and i also see it working on adreno (qualcomm) oss mesa drivers too. basically every driver stack i have access to (a fair few - everything except nevidia and nouveau on nvidia or tegra gpu's) seems to work. as i have none of these (and am not going back there any time soon because of the above) i can't reproduce what you see, or debug it other than make suggestions to you on what to try. the fact that vt switching seems to make a difference or even cause problems hints even more that there is some nvidia driver bug as this is all done inside the xserver process - the ONLY piece of code in e that knows about a vt switch is the code that modifies the backlight - it suspends modifying backlight if the vt is switched away. otherwise e works without caring about vt switches. it should not have to know as that\s the point of the windowing system and opengl abstraction. (the only reason the backlight cares is because this bypasses x to play with the backlight devices directly).

now back to your notification crash... unfortunately that is:

if (cb)
  cb(data, id);
eldbus_connection_unref(conn);

the middle line above. so either it's segfaulting inside that callback (cb) and the backtrace is not capturing that (frame #5) and thus ... i'm kind of lost or cb itself is so crazy the call to the function itself is the problem and the value of cb is the issue. this is where asan would help. there is an efl-git-asan and enlightenment-git-asan aur package set that would compile in asan support and give a proper trace on a crash to stdout/stderr that would be most helpful! :)

raster added a comment.Wed, Sep 8, 4:46 PM

just been trying to reproduce your crash in e .. i can't. holding down f when playing a yt video. ctrl+f on any window. holding down f11 in chrome too... toggles back and forth, no issue even with asan enabled in the build which should catch bugs that might otherwise be skipped. :( if you can build with asan - that'd help as i can't see the issue :(

TwoD added a comment.Wed, Sep 8, 11:55 PM

Great info, thanks!
I'll try building with asan as soon as I can and see if I can reproduce. I suspect triggering that same notification (I never saw what it was) will be tricky but maybe I can reproduce the one when toggling two YT videos fullscreen or vt switching is easier.
IIRC the only thing I had autostart which could trigger notifications, other than Chrome, is Discord. The app managing my keybard lights (ckb_next) does autostart but I don't think I've seen it trigger notifications.
I SSH:ed home and just noticed Chrome is running a process/extrension called plasma-browser-integration-host even when under E, could that be messing things up? I don't have that on my work machine which only runs E but don't know if I've gotten the same crash there (vt switching has done it though).
Is there anything else you'd like me to build differently from the Arch defaults, or perhaps some package to install?

I think I may have an AMD card from several generations back in a box somewhere, if I can find it I'll try getting it running. All my active machines currently have nVidia cards in them so that is a common factor.
I do have an old EEE901 which I could try to update, it should have an [integrated?] Intel GPU, I think, if that'd be interesting.
My main rig has an Intel one integrated in the i7-6700K but don't think I ever managed to get it running, perhaps if I don't plug in external card at all.

hmm that extension could be what is triggering things? maybe it's forcing built in browser notifications to use the notification dbus api? i don't think that is actually a bug - but it is triggering a bug in e (the same thing i am saying about the nvidia drivers :)). in this case there is an issue in e/efl somewhere and the job is to find it. such extensions - simple notification api usage by some client should not kill off e. even if they use the api poorly, e should handle it gracefully.

the amd and intel gpus will surely be a worse gaming experience for you in performance, but i bet the issue will not be there. this is more a double/triple confirmation for you that the nvidia binary blob driver is the common factor. given that all of the magic to transform/map a pixmap to an egl image and then bind it for rendering is entirely handled inside the driver stack (the code in e - that is actually in the evas engines just calls these functions that are in the gl driver to ask it to do this - and it does the same thing regardless of what driver is there) either say we are doing something horribly wrong in how we call these that only fall part on nvidia (but seemingly kde suffers from this too sometimes), or it's an nvidia driver bug. :(

as i said - i don't have any nvidia hardware anymore. i can't even begin to poke at in in a vain hope of guessing what this might be. the right way to debug as i have done a few times over the years on other driver sis... i build mesa with full gdb debug and i start tracing into the driver to find out if it has any logic at least in the library that says "if x and y and z then take some error path" and then from x, y, z i try figure out the situation that causes it. as this is a big fat binary blob... no chance. thus "ask nvidia - it's their blob - they are the only ones that can look into it and find out" (yes in theory i could trace every cpu instruction at the machine code level and attempt to figure out from the many mb of code what maybe going on but that is just silly and i won't waste my time when the vendor has chosen to make life hard with a blob - i will invest my time if it means rebuilding something like mesa, using gdb and printfs etc. etc.)

so even if i could do the above.. i have no hardware to run it on, so there is a big blocker, but it's pretty much impossible for me to get you to do the above even if it was oss. :) the best option is to set up some way to easily reproduce this and clearly document it and file that with nvidia - after you have verified that nouveau doesn't suffer, intel gfx doesn't, amd gfx doesn't etc. thus they at least can't dismiss it as "application bug" right away. they have to either just ignore your report (which is pretty poor for a paying customer), or dismiss it invalidly (not actually look but just decide they don't have time for your silliness :)), or they have to look and they may find out the cause after some digging into their own stack of coce. they may find eventually it is something we do - like how we create the eglimage and we do it in some order that happens to be technically invalid (i don't believe we do - but i'm happy to be corrected with some information on that) or something like this. the fact that on everything else from mali ddk binary blob drivers through to all the other oss mesa ones ... it works ... i'm taking the leap that we have somehow stumbled on a bug in the nvidia driver that happens to fall into a rare-yet-exists black hole where the driver just stops mapping the requested pixmap to anything that renders when we ask it to. so asking nvidia is actually the best course of action. you can tell them to talk to me if they actually get around to looking into this. i have a sneaking suspicion they may never bother (niche wm/compositor with smaller userbase where they probably have to install arch and all these aur's and set it all up which is actually a lot of work). they would bother with gnome, maybe kde too, but e - nah. i'd love to be pleasantly surprised though. :)

raster added a comment.Thu, Sep 9, 1:01 AM

i just installed the plasma browser integration extension - it seems to fail (tries to connect and failed - not surprising - it's not under plasma and possibly looking for some service). i'll see if this changes anything.

TwoD added a comment.Sat, Sep 11, 11:54 AM

I was able to reproduce the vt swithching crash after installing the asan versions. Switched back and forth a few times and noticed all windows were black when coming back, and the last couple of times I only got the cursor, and then it froze too. Copied the .e-crashdump file and appended the datetime, then tried switching back and saw the Guru meditation error, and recovery worked. After that I read the logfile and saw it mentioning to use eina_btlog so did that, and then noticed the .e_logs folder and got the crashdumps from there too in case they are different. My .x-session-errors was empty.




Still experiencing the black screens when launching games, and very often the double rendering & freeze when fullscreening (not maximizing) Chrome, regardless of what's open in it. Curiously I've not been able to reproduce it with Firefox. Tried resetting chrome flags to defaults and even disabling some acceleration related ones but it still freezes irregularly. I caught it happening while running under apitrace and can find the frame it resizes the window to fullscreen but only draws the original sized contents in the top left corner. Then eventually comes a frame where it's drawn the fullscreen window correctly (and it was updating between frames), but then starts drawing the original small window on top of it, updating only it for the next frames, though I did not actually see it update anything.

Launching games does not output anything into e-log other than things like this

libinput error: client bug: timer event7 debounce: scheduled expiry is in the past (-10ms), your system is too slow
libinput error: client bug: timer event7 debounce short: scheduled expiry is in the past (-23ms), your system is too slow
CRI<584998>:e ../src/bin/e_hints.c:390 e_hints_client_stacking_set() Window list size less than window count.
## Copy & Paste the below (until EOF) into a terminal, then hit Enter

And the eina_btlog output is identical to the included file.

The top crashdump is the same place. It's deep inside the nvidia driver somewhere and no idea why. this certainly is not the notification bug thing you saw (I've been trying to reproduce it with asan on and find nothing here - e keeps running solidly day after day with notifications and i installed the same browser extension you did). but .. this is on amd - different gl library. as the gl library is just code that executes inside of the enlightenment process, a problem inside that library can cause just about any problem like memory corruption and crashes. it's not isolated in any way. asan doesn't find any issue because it isn't directly inside code executed by efl or e (as asan adds checking code into that to make sure it's doing things right). youre option now is to use valgrind - but it's slow. it interprets every instruction a process runs and does this checking for ever ything. it may not be able to provide details other than "some code inside libnvidia.so did X to memory Y" but perhaps that memory might be some memory efl passed into gl. if efl (evas gl engine in this case) was passing bad memory into drivers and then they accessed it - i would be seeing your bug too i imagine. my guess here is this memory is internal to nvidia's driver. you're down to valgrind now though.

the second one is an internal consistency check... it's not happy with the pixmap size being different to the object size. this is some code zmike added to force crashes. it's not imho a good thing as it's some race condition that is in and of itself not critical. crashing at this point won't help find and fix the problem. it just points out you have one. the race is somewhere else. i'll disable that CRI as the worse that will happen is maybe for a single frame you will get some window image slightly stretched or something. that's certainly not worth forcing a crash on users. one thing - your line numbers do not match mine. ../src/bin/e_comp_object.c:2620 is where that log is and that is not where it is for me... are you not using git master?

i just converted a whole lot of CRI's to ERRs that are not useful in finding a bug where they are so this 2nd one should be gone. but it's time for valgrind i guess.

TwoD added a comment.Sun, Sep 12, 2:13 AM

Ah, maybe I can run it through valgrind later, will see.

Thanks for the changes, I can see how those crashes are helpful during development but when just using it they get annoying if you encounter them often, as I seem to do.

Strange about the line numbers. All I changed was the WL flag in the E PKGBUILD so it should have been from master. Maybe the debug packages didn't install correctly or something. I've swapped back to the non-asan builds right now because they gave me issues running EFL app like Terminology or the screen setup tool and didn't have time to investigate why.