Page MenuHomePhabricator

webgl in browsers unusable
Closed, ResolvedPublic

Description

Many people tested on various machines and we reach the conclusion that E does something funky with firefox.

On the same machine under X, firefox has webgl disabled in E and enabled in KDE.

I personaly tested with E_X and Weston with firefox running under Xwayland and was not working in E while working in Weston.

Many other people confirmed it. EFL compiled with EGL might be a cause but at some point this was working with E.

ApB created this task.Jun 23 2017, 11:16 AM
ApB reopened this task as Open.Aug 16 2017, 2:22 AM

Not fixed. I still see this on stable on waylan.

zmike added a comment.Aug 16 2017, 7:40 AM

It's probably never going to be fully resolved in the E21 branch since this is not a stable thing to backport.

zmike renamed this task from Enlightenment messes up with firefox. to webgl in browsers unusable.Aug 17 2017, 8:54 AM
zmike reassigned this task from zmike to raster.
zmike triaged this task as High priority.

After a day in testing, this is a symptom of a much larger issue: no browser successfully detects webgl availability (ie. gl usability) when launched directly from enlightenment; this occurs in both wayland mode and x11 mode. Launching the browser from a terminal works fine, and browsers successfully detect gl availability, but when launched directly from enlightenment it fails every time.

Some relevant links for firefox:
gl detection code
ticket discussing gl detection code
gl driver blacklisting code

zmike raised the priority of this task from High to Showstopper Issues.Aug 23 2017, 9:17 AM

Now EFL always reports that my driver is blacklisted under wayland.

llde added a subscriber: llde.Sep 10 2017, 2:06 PM

It's a bit random for me. Sometimes it report that the driver is blacklisted on Webgl and webgl2. Sometimes (most of the times) it report blacklisted driver only on Webgl2. Sometimes as right now it report working both.
I'm on the latest firefox nightly, with Stylo activated.

When I use the command line (terminology) always work properly.

This is on X and on Wayland
ArchLinux x64
AMD Readeon R5
Mesa 17.3.0 devel
efl 1.20.3 enlightenment 0.21.9

I was thinking that I may try to create a small test case if there isn't already one.

http://webglreport.com/?v=2 -> works for me (chromium). chromium launched from e.

broken in firefox. gl works in rage. in games. in many places. i really don't know how it's us. we don't set any env vars that cause this (the env is the same from shell and e except shell has more env vars added... so from e is a subset). clutching at straws - some fd inherited from the e process? but how would inheriting an fd cause this? it seems to be something firefox is explicitly doing. i'm only looking at x11 here. not wayland at this point.

i DO know that glxtest.cpp fails. at least setting MOZ_AVOID_OPENGL_ALTOGETHER=1 and stracing including following child processes finds a:

20437 write(11, "The MOZ_AVOID_OPENGL_ALTOGETHER "..., 63 <unfinished ...>

so it is getting the env var and getting this far. so either this or some other logic taking the output of this fails. i do notice that a broken strace only seems to open /dev/nvid* devices in one burst, and a working one does it multiple times.

without basically compiling my own firefox and inserting probe/test code line by line to find where it fails... i don't know. but firefox is making the decision to fail. gl does WORK for everything i run under e when it doesn't go through some elaborate probing (games from steam, rage with efl, other gl things like glxgears etc.).

ApB added a comment.Sep 11 2017, 1:11 AM

The question is why it only happens in E and not on any other compositor.

llde added a comment.Sep 11 2017, 2:00 AM

@raster the fact is that also with firefox is completly random.
Booted the PC, started firefox from E in the icon, Webgl2 fails to create context. Exited restrated firefox, the same result. Restarted firefox to install nightly update, Webgl2 works.

Webgl1 always work un this test but sometimes it fail too.

llde added a comment.EditedSep 11 2017, 1:33 PM

Can this or something similar be relevant?
https://bugzilla.mozilla.org/show_bug.cgi?id=718629

I see:

WebGL creation failed: * Refused to create native OpenGL context because of blacklist entry: * Exhausted GL driver options.

for webgl 1 and similar (blacklist) for webgl 2.

i don't see how e has anything to do with driver being used in x11. this is a matter between firefox and the xserver. it doesn't go through e. the only thoughts are some inherited environment (env vars, fd's, ... other flags like security - but we don't set any of these)... thus why i looked at environment first. i see your regular libGL being loaded by firefox like everyone else:

20644 open("/usr/lib/libGL.so.1", O_RDONLY|O_CLOEXEC) = 11
20644 open("/usr/lib/libGLdispatch.so.0", O_RDONLY|O_CLOEXEC) = 11
20644 open("/usr/lib/libGLX.so.0", O_RDONLY|O_CLOEXEC) = 11
20644 open("/usr/lib/libEGL.so.1", O_RDONLY|O_CLOEXEC) = 11

20646 open("/usr/lib/libGLX_nvidia.so.0", O_RDONLY|O_CLOEXEC <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libGL.so.1.0.0", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libGLdispatch.so.0.0.0", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libGLX.so.0.0.0", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libGLX_nvidia.so.384.59", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libGLX_nvidia.so.384.59", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libGLX_nvidia.so.384.59", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 13
20646 open("/usr/lib/libGL.so.1.0.0", O_RDONLY) = 12
20646 open("/usr/lib/libGLdispatch.so.0.0.0", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libGLX.so.0.0.0", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libGLX_nvidia.so.384.59", O_RDONLY) = 12
20646 open("/usr/lib/libEGL.so.1.0.0", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12
20646 open("/usr/lib/libEGL.so.1.0.0", O_RDONLY <unfinished ...>
20646 <... open resumed> )              = 12

20720 open("/usr/lib/libGL.so.1", O_RDONLY|O_CLOEXEC) = 7
20720 open("/usr/lib/libGLdispatch.so.0", O_RDONLY|O_CLOEXEC) = 7
20720 open("/usr/lib/libGLX.so.0", O_RDONLY|O_CLOEXEC <unfinished ...>
20720 <... open resumed> )              = 7
20720 open("/usr/lib/libEGL.so.1", O_RDONLY|O_CLOEXEC <unfinished ...>
20720 <... open resumed> )              = 7

etc. ... so 3 processes (parent and children) seem to load the right gl libs... why it then blacklists... i don't know. if i use about:config and set vendor and renderer string to what glxinfo reports, webgl1 works. webgl2 still blacklisted, but webgl1 actually does work. finds all the right extensions etc. ... so it is seemingly finding the right driver as above stracing implies... but then is choosing to deny use of it for another reason.

FYI. some more stracing -> this is what is reported from the glx probe process to the parent in terms of vendor and renderer:

7138  write(12, "VENDOR\nNVIDIA Corporation\nRENDERER\nGeForce GTX 970/PCIe/SSE2\nVERSION\n4.5.0 NVIDIA 384.59\nTFP\nTRUE\n", 98) = 98

so correct vendor, and renderer string and texture from pixmap flag is true... it's all correct coming from the child. BUT ... i only see a write(). no read(). the parent process just never reads this output. i see the child process exit:

7138  +++ exited with 0 +++

so exit code 0... parent gets the sigchld:

7136  --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7138, si_uid=1000, si_status=0, si_utime=0, si_stime=5} ---

all correct... later on i DO see something odd:

7249  wait4(-1,  <unfinished ...>
7249  <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 7138

that's not the parent process where sigchld was reported... but it SEEMS to be waiting for children and finds the gl checking process exiting... 7136 actually creates 7249:

7136  clone( <unfinished ...>
7136  <... clone resumed> child_stack=0x7fb0926fdfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fb0926fe9d0, tls=0x7fb0926fe700, child_tidptr=0x7fb0926fe9d0) = 7249

this is quite ... unusual. but anyway. what I do see here is simply firefox never reading the output of its gl probe process. in fact... this smells very very very suspicious. http://searchfox.org/mozilla-central/source/widget/GfxInfoX11.cpp indicates in GfxInfo::GetData() that it will read the pipe THEN wait for the child exit code for that pid... specifically. the code never gets this far. it doesn't do the read ... let alone the waitpid() for the specific child - there is a generic "wait for anything" as above (the -1 wait4()) which finds the child exit code... so it never even gets this far ...

palemoon (firefox fork) has this issue too and reports

GPU Accelerated Windows 0/1 Basic Blocked for your graphics driver version. Try updating your graphics driver to version <Anything with EXT_texture_from_pixmap support> or newer.
WebGL Renderer Blocked for your graphics card because of unresolved driver issues.

while firefox reports:

WebGL Renderer WebGL creation failed: * Refused to create native OpenGL context because of blacklist entry: * Exhausted GL driver options.

an strace of both shows, like rasters, that the detection inside the fork works and is written, but it is never read by the parent.

In T5606#95094, @zmike wrote:

After a day in testing, this is a symptom of a much larger issue: no browser successfully detects webgl availability (ie. gl usability) when launched directly from enlightenment;

chromium does for me. it disables webgl2, but maybe my pc is too old anyways. but it detects ati gpu and applies lots of workarounds, so it's a firefox/palemoon issue for me.

now i never used the pipe syscall, i think it comes down to this in firefox.

/toolkit/xre/glxtest.cpp::260

extern int glxtest_pipe;

bool fire_glxtest_process()
{
  int pfd[2];
  if (pipe(pfd) == -1) {

(....)

   mozilla::widget::glxtest_pipe = pfd[0];
}

where pfd[0] is 0 and glxtest_pipe also which is then checked in /widget/GfxInfoX11.cpp:53 right before the read() call.

void
GfxInfo::GetData()
{
     // to understand this function, see bug 639842. We retrieve the OpenGL driver information in a
     // separate process to protect against bad drivers.

     // if glxtest_pipe == 0, that means that we already read the information
     if (!glxtest_pipe)
         return;

and yep, the attached test code from the man page does exactly that for me.

started via terminology it prints and writes to /tmp/pipe

pipefd[0] 3 pipefd[1] 4

starting via enlightenment file manager writes to /tmp/pipe

pipefd[0] 0 pipefd[1] 3

so, at least for firefox it looks like its the mozilla guys fault. now i only need to find my credentials for their bugzilla...

aha! pipe fd is 0! ... probably because stdin has been closed by e before execution because... there is no stdin! and kernel recycles fd 0 now for the pipe. indeed bug in browsers. 0 is a valid fd. -1 is what it should be (or anything < 0) to indicate invalid. indeed a classic mistake that is rarely found. palemoon is a firefox based browser so it'd suffer the same issue. chromium maybe suffering the same one too. indeed we close stdin for security reasons... otherwise processes we fork and exec inherit e's stdin... which could be the actual console (tty) itself which would give processes the ability to mess with vt's and thing is they absolutely shouldn't. so we are probably the only De or one of the very few that actually does this right and properly nukes stdin... thus why they haven't cottoned onto this behavior yet and hit this issue and... haven't realized the fat security hold lurking elsewhere...

we are probably the only De or one of the very few that actually does this right and properly nukes stdin

:)

in firefox nightly/aurora this is now fixed and will be in firefox 57. i don't know if it gets backported, but given it's a small fix i assume so.

also fixed in palemoon unstable.

how big a project is, can be identified by its bureaucracy.

Typically only recent regressions, frequent crashes, or security bugs are backported to more stable branches.
AIUI this is a long-standing issue. It's also very late in the 56 beta cycle for this kind of fix.

https://wiki.mozilla.org/Release_Management/Uplift_rules#Beta_Uplift_.28approval-mozilla-beta.29
https://wiki.mozilla.org/RapidRelease/Calendar

not sure they are doing the "but it's trivial and provably correct" weigh-up (an fd of 0 is a perfectly valid fd... always has been and just moving to -1 solves it trivially). potential impact - close to 0. potential positives: obvious already.

In T5606#99139, @raster wrote:

not sure they are doing the "but it's trivial and provably correct" weigh-up (an fd of 0 is a perfectly valid fd... always has been and just moving to -1 solves it trivially). potential impact - close to 0. potential positives: obvious already.

yes, exactly. but by the tone of that answer i'll rather mail my distribution maintainer to include that patch than to try to pointlessly discuss this with mozilla...

ApB added a comment.Sep 21 2017, 10:41 AM

how big a project is, can be identified by its bureaucracy.

Typically only recent regressions, frequent crashes, or security bugs are backported to more stable branches.
AIUI this is a long-standing issue. It's also very late in the 56 beta cycle for this kind of fix.

https://wiki.mozilla.org/Release_Management/Uplift_rules#Beta_Uplift_.28approval-mozilla-beta.29
https://wiki.mozilla.org/RapidRelease/Calendar

If its a security threat -like raster describes above- write an exploit that fucks things up > release it > see them patch it 10 versions back while you sip on your fav alcoholic beverage and laughing your ass off :D

We could work around this kind of crap in E by just opening /dev/null (without cloexec!) and using dup2(that, 0) after we close stdin I guess.

It's not a security bug in firefox - the security bug would potentially be in any desktop that leaks fds to child processes.

I just checked under gnome-shell and it looks like child processes inherit fd0 as /dev/null. I was... actually suggesting that as a joke. :(

Should we do this too?

(also E is leaking pipes again, I'm going to go shout profanities at my dogs for a while then figure out who else gets some of that.)

@raster I just spent a looong time trying to figure out where E actually closes stdin, as i wanted to add the /dev/null dup2() in the same place.

It appears to be a completely unintentional side effect of initializing fds to 0 in a bunch of EO objects - ecore_ipc_server_connect() is causing fd 0 (which is still /dev/pts/whatever at the time) to be set cloexec.

The level of irony here appears to be absolutely astronomical, since we're treating 0 as "invalid fd" and complaining about DEs that don't close stdin (which we do entirely by accident)...

@ManMower - i remember you explicitly talking about this stdin/out stuff face to face as a security hold especially for wayland as it gives wl clients we spawn access to the tty... i thought you fixed it... but i distinctly remember this conversation several months ago...

Ok, so scumbag @cedric just fixed the accidental stdin close for us, so it's probably possible to write a keylogger for our wayland compositor now. I have a local E patch to set close on exec for stdin, but due to this persistent network or server outage I can't push it.

Turns out we were leaking stdout and stderr anyway though, which are probably still bad for child processes to mess with, but I'm sure someone's depending on them being shared for logging something?

This along with the 2 new pipes I haven't taken time to track down yet.

Next week I'd like to make ecore_exe_run() close stdin and replace it with /dev/null so the fd exists, unless someone stops me of course...
I can't really think of a good way to do this in E since we can't do anything between the fork() and exec().

Some serious though needs to be given as to whether stderr and stdin should be handed to children - I think some of the same ioctls that work on stdin can be used on them as well, and a child with those fds open could truncate log files?

ok... now even worse. i can't fine anything behind ecore_ipc that closes fd 0 ... well at least starting with the ecore_ipc examples as a faster to hunt simpler example...

a ha, bad local firewall rule was messing up ssh connections originating inside my network.

Cedric fixed the unintended fd 0 cloexec with efl commit 17507bab43

oh ok. you were looking and it seems cedric patched it before i started looking... like a few hours before only! :)

yes, I remember that conversation, and I thought I fixed this too, but I think what actually happened is I fixed it in weston, spoke to you about fixing it in E, then discovered E didn't even have the problem.

And then today I discovered why E didn't have the problem. And then things got messy. Sorry for the snark, I'm just annoyed I didn't catch this back when I was first looking into it.

For years we've had correct behaviour because of a bug. Now we have it on purpose. I guess that's progress. :)

I'll write up the /dev/null thing for ecore_exe_run monday, and then I think we can close this ticket without waiting for the firefox fix to land in a release.

oooh so thats what happened. i really thought you just fixed it and thats why stdin is closed... indeed that it's closed by a bug is bad... but it was good ti was closed... :)

oh i can do the ecore_exe flags stuff too... i was just posing that as a "isn't this better?" idea...

ah, if you've got a better idea I'll stand back and watch :)

I won't have time to touch anything until monday at the earliest.

Just let me know if you want me to do the dup2 /dev/null thing like gnome does, or if you've got something up your sleeve.

zmike closed this task as Resolved.Sep 29 2017, 9:58 AM