Page MenuHomePhabricator

EFL EO thread safety with _thread causes liker issues e.g. evas gl module loading
Open, TODOPublic

Description

As of EFL commit 4478654be185718ade7482df55c8643d3224fb37 GL compositing in E on X11 is broken. I get an error on startup:


Your display driver does not support OpenGL, GLSL, sharers or no OpenGL engines were compiled or installed for Evas or Ecore-Evas. Falling back to software engine.

You will need an OpenGL 2.0 (or OpenGL ES 2.0) capable GPU to use OpenGL with compositing.


Reverting to EFL commit 09f19c3c73b73eeee4201bb462bad9f2b5276409 re-enables GL compositing in E. E on Wayland doesn't appear to be affected.

dkasak created this task.Sep 14 2016, 6:45 PM
raster added a subscriber: raster.Sep 15 2016, 1:31 AM

WTF? how can making the oid table a TLS affect the evas gl engine init? this only has to do with eo object access and not internals... ? thell the gl engine doesnt even use any threads... so ... how?

fyi i'm using evas + e + gl compositing and at least on nvidia and intel drivers it's working - post that commit.

are you building efl with glx/opengl or egl/gles?

I think i should also raise my voice here.

I also have problems with this commit. I am having objects which are just disappearing without beeing deleted. I could imagen that this has a simular case, just try verne before and after that commit. Before everything is ok, after it, huge loads of errors in the output and dissapearing elm.icons in the genlist/gengrid. Something broke really hard inbetween.

bu5hm4n added a subscriber: tasn.Sep 15 2016, 2:10 AM

I should add @tasn here

In T4611#70506, @raster wrote:

WTF? how can making the oid table a TLS affect the evas gl engine init? this only has to do with eo object access and not internals... ? thell the gl engine doesnt even use any threads... so ... how?

Sorry ... this is way beyond my coding ability :)

fyi i'm using evas + e + gl compositing and at least on nvidia and intel drivers it's working - post that commit.

are you building efl with glx/opengl or egl/gles?

I usually build efl with these options:

--enable-drm --enable-gl-drm --enable-systemd --enable-wayland --enable-egl --with-opengl=es --enable-elput --enable-xinput22 --enable-systemd --enable-image-loader-webp --enable-harfbuzz --enable-multisense --enable-liblz4 --enable-scim

... so I can run under X11 or see how Wayland support is going. When I hit this issue, I rebuild efl with these options:

--enable-drm --enable-systemd --enable-elput --enable-xinput22 --enable-image-loader-webp --enable-harfbuzz --enable-multisense --enable-liblz4 --enable-scim

... and got the same results - ie no GL compositing.

@bu5hm4n well there is a possibility since you are using eio that you were calling eo api's (via legacy) from threads... and thus... voila. it found your bugs. everything else i use/try works fine... that is kind of the point. this passes make check in efl. e runs. rage runs. terminology runs. i can find more examples.

verne doesn't even compile atm. :)

in fact i found a bug while doing this where the upper bit of the id part (below the super and ref bits) was doubled up as a "this is a class" bit so once your eoid's got big enough eo would fall over badly. this actually happened in the test suites because i was now using these bits for domain id's. (2 bits for them). so i spent a while beating my head on that and fixed it.

so if you are seeing lots of object creations fail or finding objects fails (because code is accessing/creating from different threads) then the behavior you see is EXACTLY what i'd expect to see.

in fact i can INSTANTLY find bad code in verne:

efm_fs_file.c. _fs_cb(). calling efreet functions FROM a thread. boom. as per efl docs. look on e.org docs and in elm docs etc. - they say clearly efl is not threadsafe EXCEPT for some specific thread related functions in ecore inteded to be run there (e.g. ecore_thread_check()) and eina (in as much as it can be - ie modify a list in thread1 while walking in thread2 - god luck. add locks, but for any hidden/transparent global data tats handled).

also technically _extract() in archive.c too -> ecore_file_file_get(). yes i know this doesn't actually cause issues.

now if eio is calling cb's from a thread that is awfully bad... unless its clearly documented it can/will. i see eio usage... this could be a bug in eio...

anyway @dkasak ... that is a distraction. how can adding TLS stuff make your gl DRIVER layer refuse to work... this is far removed from the eo level of things. could it be your gl driver uses threads internally and also tls (well ok gl drivers DO use tls. gl contexts are actually tls bound state)... what driver/gpu? it'd be nice to know exactly what in gl engine init fails. you can test this without e. just run any efl app like:

export ELM_ACCEL=gl
elementary_test

does it complain of engine init failure? get more logs and bt's with:

export EINA_LOG_LEVEL=4'
export EINA_LOG_BACKTRACE=999

As i said above i am talking about elm.icons which are still in the child list of a elm.widget, which are freed but the del event never came to elementary.

Also sure there is bad code in verne, there is also in e, in terminology and in every other efl app. But what does it matter, ecore_file_file_get does not do anything at all with eo or anything which could fuck something from a different thread, same for the function in elm_file_fs.c, efreet_mime_fallback_type_get does not do anything with other efl api which is not thread independed ...

And yes eio is calling callbacks from a thread, AND it is clearly documented. also you can trust me a bit that i know which code works in my filemanager and which doesnt. Also each object in my lib is a eo object, so after the logic of "complain if accessing from outside the owning thread" you should see a huge errormessage on the screen which tells "hey this object does not belong to this thread". And oh see, no its not happening.

Thanks for the response @raster. The backtrace:

and the log without backtrace: . Something I noticed while scanning through it myself:

DBG<6169>:eina_module lib/eina/eina_module.c:284 eina_module_new() m=0x5565cf767cf0, file=/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so
DBG<6169>:eina_module lib/eina/eina_module.c:312 eina_module_load() m=0x5565cf767cf0, handle=(nil), file=/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so, refs=0
WRN<6169>:eina_module lib/eina/eina_module.c:328 eina_module_load() could not dlopen("/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so", dlopen: cannot load any more object with static TLS): RTLD_NOW
DBG<6169>:eina_module lib/eina/eina_module.c:293 eina_module_free() m=0x5565cf767cf0, handle=(nil), file=/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so, refs=0
DBG<6169>:evas_main modules/evas/engines/software_generic/evas_engine.c:5777 init_gl() Initializing Software OpenGL APIs...
DBG<6169>:evas_main modules/evas/engines/software_generic/evas_engine.c:5757 gl_lib_init() Unable to open libOSMesa:  dlopen: cannot load any more object with static TLS
DBG<6169>:evas_main modules/evas/engines/software_generic/evas_engine.c:5780 init_gl() Unable to support EvasGL in this engine module. Install OSMesa to get it running
DBG<6169>:eina_module lib/eina/eina_module.c:284 eina_module_new() m=0x5565cfb6fb00, file=/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so
DBG<6169>:eina_module lib/eina/eina_module.c:312 eina_module_load() m=0x5565cfb6fb00, handle=(nil), file=/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so, refs=0
WRN<6169>:eina_module lib/eina/eina_module.c:328 eina_module_load() could not dlopen("/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so", dlopen: cannot load any more object with static TLS): RTLD_NOW
DBG<6169>:eina_module lib/eina/eina_module.c:293 eina_module_free() m=0x5565cfb6fb00, handle=(nil), file=/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so, refs=0
DBG<6169>:eo lib/eo/eo.c:670 _eo_class_funcs_set() 0x7fe0094af3e0->0x7fe0094afc70 'elm_obj_glview_gl_api_get'
DBG<6169>:eo lib/eo/eo.c:670 _eo_class_funcs_set() 0x7fe0094af4f0->0x7fe0094af040 'elm_obj_glview_evas_gl_get'
ERR<6169>:evas_main lib/evas/canvas/evas_gl.c:151 evas_gl_new() Evas GL engine not available.
ERR<6169>:elementary lib/elementary/elm_glview.c:243 _elm_glview_constructor() Failed Creating an Evas GL Object.

I'm running on an Intel GPU, with mesa built from git:

OpenGL vendor string: Intel Open Source Technology Center
OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 520 (Skylake GT2)
OpenGL core profile version string: 4.4 (Core Profile) Mesa 12.1.0-devel (git-148fbf3)

I hope this helps. Let me know if there is any other info I can gather.

@bu5hm4n this is getting off topic... BUT bad code and threads... you call efreet mime api's from a thread. efreet code is not written to be threadsafe at all, so you are asking for trouble. it happens to work. it's not related to eo, yet you do it. the issue now is with your view of the world that just because it happens to work before and now does not that it's not your problem... if you take this view with all of your code which you seemingly do then i can just assume it's crawling with bugs of your own making that are similar and i've just triggered them. at this point i stop looking because you are simply dismissing your code entirely from any blame even though i just directly pointed out already 2 areas, and your response is "so what?". sure it's not related to icons or eo/elm but i smell bugs waiting to happen related to thread safety. in fact efreet there WILL sooner or later crash as the mainloop does a mime db update WHILE a thread accesses the mimedb info at the same time... boom. waiting to happen. if you said "oh oops. my bad. i'll fix that" i might be inclined to look further into eio and what is and is not called from a thread and what code you have may or may not be doing that. i'll leave it up to you to find a reproduction case that is simple. you don't use ecore_thread_main_loop_begin/end() so that was the only vector where i thought maybe there might be an issue. yes maybe eo could be far more helpful in its debug output when it cannot find an eo id, e.g. give the thread id, domain id etc. but i cannot tell you for SURE it belongs to another thread and which one because there is ambiguity in eoid's across threads (thus the domains which provide 4 regions for eoid's to live in).

@dkasak ...

WRN<6169>:eina_module lib/eina/eina_module.c:328 eina_module_load() could not dlopen("/opt/efl/lib/evas/modules/engines/gl_x11/v-1.18/module.so", dlopen: cannot load any more object with static TLS): RTLD_NOW

oh .... poo. well well well. THAT explains the problem. why do you suffer and not i? 32bit? 64bit? i have intel mesa here - working fine. this is the key... and it isnt the tls stuff in eo @bu5hm4n ... this is __thread for the eo op id lookup cache that wasn't threadsafe at all before. i fixed that

@dkasak - can you do this. in Eo.h in the efl src tree find

  1. define EFL_FUNC_TLS __thread

and just make it

  1. define EFL_FUNC_TLS

then rebuild efl. does problem go away?

some googling:

http://stackoverflow.com/questions/14892101/cannot-load-any-more-object-with-static-tls

fyi i'm on holiday atm traveling, but the issue i see here is something i'll have to chase up and i'm being yelled at to get up and do tourist things. can't now.

so some looking - there might be one .so that is loaded that is not -fPIC and that may be the root cause. perhaps doing:

readelf -d `find /usr -name '*.so'` | grep TEXT

? some .so is not -fPIC on i386 and that is the root cause ... perhaps. finding it is the issue. maybe also:

readelf --relocs FILE.so | egrep '(GOT|PLT|JU?MP_SLOT)'

? if the above is empty for a .so ... bad... in fact...

for I in `find /usr -name '*.so'`; do echo -n "$I   "; readelf --relocs $I | egrep '(GOT|PLT|JU?MP_SLOT)' | wc -l; done | grep " 0$"

... i see some possibly bad results:

/usr/lib32/libGLESv2.so   0
/usr/lib32/mesa/libGLESv2.so   0
/usr/lib32/mesa/libGLESv1_CM.so   0
/usr/lib32/libGLESv1_CM.so   0

but many others too... this may or may not actually be an issue. as per http://stackoverflow.com/questions/1340402/how-can-i-tell-with-something-like-objdump-if-an-object-file-has-been-built-wi but i/m just starting to look. but as above... some .so somewhere has maybe linked in non-PIC code and this causes an issue. perhaps. the __thread thing if you disable it should make things work again but those .so's should be -fPIC.

i could have tried a far more involved solution using eina_tls but this would move the cost from linktime to runtime... :(. so i tink its not the eoid tls stuff at all but the threadsafety fixes that went along with it. let me know if removing __thread as per previous message helps. i suspect it does.

In T4611#70585, @raster wrote:

@bu5hm4n this is getting off topic... BUT bad code and threads... you call efreet mime api's from a thread. efreet code is not written to be threadsafe at all, so you are asking for trouble. it happens to work. it's not related to eo, yet you do it. the issue now is with your view of the world that just because it happens to work before and now does not that it's not your problem... if you take this view with all of your code which you seemingly do then i can just assume it's crawling with bugs of your own making that are similar and i've just triggered them. at this point i stop looking because you are simply dismissing your code entirely from any blame even though i just directly pointed out already 2 areas, and your response is "so what?". sure it's not related to icons or eo/elm but i smell bugs waiting to happen related to thread safety. in fact efreet there WILL sooner or later crash as the mainloop does a mime db update WHILE a thread accesses the mimedb info at the same time... boom. waiting to happen. if you said "oh oops. my bad. i'll fix that" i might be inclined to look further into eio and what is and is not called from a thread and what code you have may or may not be doing that. i'll leave it up to you to find a reproduction case that is simple. you don't use ecore_thread_main_loop_begin/end() so that was the only vector where i thought maybe there might be an issue. yes maybe eo could be far more helpful in its debug output when it cannot find an eo id, e.g. give the thread id, domain id etc. but i cannot tell you for SURE it belongs to another thread and which one because there is ambiguity in eoid's across threads (thus the domains which provide 4 regions for eoid's to live in).

My response is not "so what" my response was "this api is completly undependend from thread stuff", its the fallbackhandler which just checks for the file content. this is why i am not carring about this api.

Now its just getting more and more offtopic, but why should the fallback handler which just checks for the first 32 bits of a file ever triffer mimedb? I am just calling the fallback stuff there just because i have checked the code there, and even if the fs dies, this will not take down the complete app, the fallback is set at first, after that the other get methods are called, in a idler in the ml.

Then back to the topic, it is in elm.icon which gets created from the callback of a genlist. which is ... in the ml ..., this is not even related to eio or my usage of it. even more the icons which are causing the trouble are elm.icon´s with normal standard icons. not at all related to eio or anything like that...

Also, if we go the path of "never use anything of e* api outside the thread even if its possible after checking the code, then even enlightenment is asking for huge huge trouble...

raster added a comment.EditedSep 16 2016, 8:31 AM

Have you updated efl? I assume not as verne no longer compiles. The eois tls change pointed out a bug in the efl.ui.image thread preload code that uses threads. Cedric "fixed" it last week then i spotted it and refixed it to be far better and cleaner and less blocking. This was about the same time efl_self changed. Verne doesn't work with these changes when i tried earlier today or last night. Don't remember. I'm on my phone outside the Vatican atm. ?. But the eoid changes FOUND A BUG in efl. It got fixed already last week. I was assuming you're up to date but maybe you aren't? The point is back to my original. The eoid changes are good for code QA because they fail early rather than much later at random.

You still have 2 bugs i found. They are bugs. Yes efreet magic checks can be an issue for you with threads. Efreet loads the magic mime files directly into global variables and they are not mmapped from the db atm. If they were mmaped you'd have a bug too. You'd have a bug just by definition regardless of the internal code in efreet. If your thread is looking up the hash or list inside the efreet api while mainloop is doing a reload of the mime magic files then boom. Crash. The data isn't locked to stop this. I'm telling you it is defined not to be safe. By docs. It also is provably jot safe.

For ecore_file ... is also defined not to be safe. that is the assumption with efl code everywhere EXCEPT where i mentioned. Eina within reason and ecore_thread calls for check feedback and mainloop begin end sync aysnc call and wait.

This is the assumption of all code inside efl and it's the base premise for any changes and optimizations in future. Those are the limitations we accepted long ago before threading was useful or well supported. When we work to make things more thread friendly it'll be very carefully and that is exactly why eoid is doing the tls thing. It's being careful. There is a special shared domain all threads can use for shared objects BUT the object must support it and atm there is no way to know if it does so this by definition none do beyond eo base class. There is a fairly fat cost to accessing a shared object anyway so it'd be very smart to limit the use a lot.

We have zero objects atm that can exist outside the mainloop and until we do... it's a limitation we all have to live with. Node.js had this limit and people build powerful things with it all the time. It's workable. But we'll delimit things as we go on. We're designing in the ability to do so cleanly and safely.

I just pushed the changes (i usally keep build breaks as long as i have a working state with the new api, so i dont get a shitload of bug tickets), and yes, i am up to date, its possible that the commit discovered a bug in the even emission in eo, i am still debugging, problem is that the object gets deleted, but somehow the EFL_EVENT_DEL does not get delivered ( at least not to the end of elm.widget).

And i can start again from the beginning, I checked it myself, read the code, this is why i am doing it. But if you dont belive me, comment out the usage - doesnt change anything. I came to a quite usable state of debugging and have the code flow from below, i just dont know what it tells me, fact is that elm.widget does not get called. But the object is never deleted from the subobj list, and thus the callback is never deleted.

If you want i can share the debugging commit with you later, which i am using to track the issue right now ...

Anyway, no hurry on that, i am also debugging it!

Have a nice vacation! :)

I can't really look atm. It'll have to wait but following the efl_del or unref on the icon erotic be a start. I wanted to look at the b issue but verne didn't build ? and all o could do sanely was to review your code. This finding of the first issues. ? it is possible that eo has bugs but i haven't seen them anywhere in any tests or usage. Yes i wrote code that used elm image with async preload on. I can share the test if you want but it's not on my phone. ?.

Okay, NOW its REALLY pushed, i did git push in the wrong repo hit myself on the backhead

There is no need to rush on this, i just want to tell people that there are issues ...

In T4611#70588, @raster wrote:

@dkasak - can you do this. in Eo.h in the efl src tree find

  1. define EFL_FUNC_TLS __thread

    and just make it
  2. define EFL_FUNC_TLS

    then rebuild efl. does problem go away?

Yes with that change I can rebuild current efl and GL compositing still works.

ok. confirmed then... it has something to do with 32bit, -fPIC and __thread ... or reaaaaaaaaaly close to those. now ... just what is the offending thing? is there a non -fPIC .so linked/loaded? if so... which one is it? don't tell me its something inside opengl... :( we can't do much about existing .so's and being non-PIC.

re: 32-bit, I'm not sure why this was mentioned ... I'm on a 64bit Sabayon installation ( multilib ), and AFAIK I'm building 64bit binaries:

dkasak@lenin /opt/efl/bin $ ldd enlightenment
	linux-vdso.so.1 (0x00007fff0a7dc000)
	libelementary.so.1 => /opt/efl/lib/libelementary.so.1 (0x00007feadf1a6000)
	libelocation.so.1 => /opt/efl/lib/libelocation.so.1 (0x00007feadef9d000)
	libethumb_client.so.1 => /opt/efl/lib/libethumb_client.so.1 (0x00007feaded91000)
	libethumb.so.1 => /opt/efl/lib/libethumb.so.1 (0x00007feadeb82000)
	libedje.so.1 => /opt/efl/lib/libedje.so.1 (0x00007feade85f000)
	libecore_audio.so.1 => /opt/efl/lib/libecore_audio.so.1 (0x00007feade650000)
	libembryo.so.1 => /opt/efl/lib/libembryo.so.1 (0x00007feade441000)
	libecore_imf_evas.so.1 => /opt/efl/lib/libecore_imf_evas.so.1 (0x00007feade23d000)
	libecore_evas.so.1 => /opt/efl/lib/libecore_evas.so.1 (0x00007feade010000)
	libecore_input_evas.so.1 => /opt/efl/lib/libecore_input_evas.so.1 (0x00007feadde09000)
	libecore_imf.so.1 => /opt/efl/lib/libecore_imf.so.1 (0x00007feaddbfe000)
	libefreet_trash.so.1 => /opt/efl/lib/libefreet_trash.so.1 (0x00007feadd9fa000)
	libemotion.so.1 => /opt/efl/lib/libemotion.so.1 (0x00007feadd7d5000)
	libeio.so.1 => /opt/efl/lib/libeio.so.1 (0x00007feadd5b2000)
	libefreet_mime.so.1 => /opt/efl/lib/libefreet_mime.so.1 (0x00007feadd3ab000)
	libefreet.so.1 => /opt/efl/lib/libefreet.so.1 (0x00007feadd181000)
	libecore_ipc.so.1 => /opt/efl/lib/libecore_ipc.so.1 (0x00007feadcf78000)
	libevas.so.1 => /opt/efl/lib/libevas.so.1 (0x00007feadca7c000)
	libharfbuzz.so.0 => /usr/lib64/libharfbuzz.so.0 (0x00007feadc7fb000)
	libfribidi.so.0 => /usr/lib64/libfribidi.so.0 (0x00007feadc5e3000)
	libfontconfig.so.1 => /usr/lib64/libfontconfig.so.1 (0x00007feadc3a0000)
	libfreetype.so.6 => /usr/lib64/libfreetype.so.6 (0x00007feadc0d8000)
	libluajit-5.1.so.2 => /usr/lib64/libluajit-5.1.so.2 (0x00007feadbe67000)
	libector.so.1 => /opt/efl/lib/libector.so.1 (0x00007feadbc32000)
	libpng16.so.16 => /usr/lib64/libpng16.so.16 (0x00007feadb9fe000)
	libecore_x.so.1 => /opt/efl/lib/libecore_x.so.1 (0x00007feadb794000)
	libXcursor.so.1 => /usr/lib64/libXcursor.so.1 (0x00007feadb589000)
	libX11.so.6 => /usr/lib64/libX11.so.6 (0x00007feadb24c000)
	libXcomposite.so.1 => /usr/lib64/libXcomposite.so.1 (0x00007feadb049000)
	libXdamage.so.1 => /usr/lib64/libXdamage.so.1 (0x00007feadae46000)
	libXext.so.6 => /usr/lib64/libXext.so.6 (0x00007feadac33000)
	libXfixes.so.3 => /usr/lib64/libXfixes.so.3 (0x00007feadaa2c000)
	libXinerama.so.1 => /usr/lib64/libXinerama.so.1 (0x00007feada829000)
	libXrandr.so.2 => /usr/lib64/libXrandr.so.2 (0x00007feada61d000)
	libXrender.so.1 => /usr/lib64/libXrender.so.1 (0x00007feada412000)
	libXtst.so.6 => /usr/lib64/libXtst.so.6 (0x00007feada20c000)
	libXss.so.1 => /usr/lib64/libXss.so.1 (0x00007feada008000)
	libXi.so.6 => /usr/lib64/libXi.so.6 (0x00007fead9df8000)
	libecore_wl2.so.1 => /opt/efl/lib/libecore_wl2.so.1 (0x00007fead9bde000)
	libwayland-cursor.so.0 => /usr/lib64/libwayland-cursor.so.0 (0x00007fead99d6000)
	libwayland-server.so.0 => /usr/lib64/libwayland-server.so.0 (0x00007fead97c0000)
	libwayland-client.so.0 => /usr/lib64/libwayland-client.so.0 (0x00007fead95af000)
	libuuid.so.1 => /lib64/libuuid.so.1 (0x00007fead93aa000)
	libecore_drm2.so.1 => /opt/efl/lib/libecore_drm2.so.1 (0x00007fead919d000)
	libdrm.so.2 => /usr/lib64/libdrm.so.2 (0x00007fead8f8e000)
	libgbm.so.1 => /usr/lib64/libgbm.so.1 (0x00007fead8d7f000)
	libelput.so.1 => /opt/efl/lib/libelput.so.1 (0x00007fead8b71000)
	libinput.so.10 => /usr/lib64/libinput.so.10 (0x00007fead8945000)
	libxkbcommon.so.0 => /usr/lib64/libxkbcommon.so.0 (0x00007fead8702000)
	libeldbus.so.1 => /opt/efl/lib/libeldbus.so.1 (0x00007fead84be000)
	libdbus-1.so.3 => /usr/lib64/libdbus-1.so.3 (0x00007fead826a000)
	libecore_input.so.1 => /opt/efl/lib/libecore_input.so.1 (0x00007fead8056000)
	libeeze.so.1 => /opt/efl/lib/libeeze.so.1 (0x00007fead7e43000)
	libmount.so.1 => /lib64/libmount.so.1 (0x00007fead7bf3000)
	libudev.so.1 => /usr/lib64/libudev.so.1 (0x00007feadf934000)
	libecore_file.so.1 => /opt/efl/lib/libecore_file.so.1 (0x00007fead79ea000)
	libecore_con.so.1 => /opt/efl/lib/libecore_con.so.1 (0x00007fead7794000)
	libeet.so.1 => /opt/efl/lib/libeet.so.1 (0x00007fead755a000)
	libemile.so.1 => /opt/efl/lib/libemile.so.1 (0x00007fead7348000)
	libssl.so.1.0.0 => /usr/lib64/libssl.so.1.0.0 (0x00007fead70d5000)
	libcrypto.so.1.0.0 => /usr/lib64/libcrypto.so.1.0.0 (0x00007fead6c97000)
	libz.so.1 => /lib64/libz.so.1 (0x00007fead6a81000)
	liblz4.so.1 => /usr/lib64/liblz4.so.1 (0x00007fead686e000)
	libjpeg.so.62 => /usr/lib64/libjpeg.so.62 (0x00007fead6604000)
	libecore.so.1 => /opt/efl/lib/libecore.so.1 (0x00007fead63a9000)
	libgthread-2.0.so.0 => /usr/lib64/libgthread-2.0.so.0 (0x00007fead61a7000)
	libglib-2.0.so.0 => /usr/lib64/libglib-2.0.so.0 (0x00007fead5e98000)
	libefl.so.1 => /opt/efl/lib/libefl.so.1 (0x00007fead5c54000)
	libeo.so.1 => /opt/efl/lib/libeo.so.1 (0x00007fead5a35000)
	libeina.so.1 => /opt/efl/lib/libeina.so.1 (0x00007fead579f000)
	libsystemd.so.0 => /usr/lib64/libsystemd.so.0 (0x00007feadf89e000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007fead559b000)
	libunwind-x86_64.so.8 => /usr/lib64/libunwind-x86_64.so.8 (0x00007fead537c000)
	libunwind.so.8 => /usr/lib64/libunwind.so.8 (0x00007fead5161000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fead4f44000)
	libEGL.so.1 => /usr/lib64/libEGL.so.1 (0x00007fead4d0e000)
	libm.so.6 => /lib64/libm.so.6 (0x00007fead4a0a000)
	librt.so.1 => /lib64/librt.so.1 (0x00007fead4802000)
	libpam.so.0 => /lib64/libpam.so.0 (0x00007fead45f3000)
	libc.so.6 => /lib64/libc.so.6 (0x00007fead4243000)
	libstdc++.so.6 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.3/libstdc++.so.6 (0x00007fead3f15000)
	/lib64/ld-linux-x86-64.so.2 (0x00007feadf7a8000)
	libgcc_s.so.1 => /usr/lib/gcc/x86_64-pc-linux-gnu/4.9.3/libgcc_s.so.1 (0x00007fead3cfe000)
	libgraphite2.so.3 => /usr/lib64/libgraphite2.so.3 (0x00007fead3ad0000)
	libexpat.so.1 => /usr/lib64/libexpat.so.1 (0x00007fead38a6000)
	libbz2.so.1 => /lib64/libbz2.so.1 (0x00007fead3695000)
	libxcb.so.1 => /usr/lib64/libxcb.so.1 (0x00007fead346c000)
	libffi.so.6 => /usr/lib64/libffi.so.6 (0x00007fead3263000)
	libmtdev.so.1 => /usr/lib64/libmtdev.so.1 (0x00007fead305d000)
	libevdev.so.2 => /usr/lib64/libevdev.so.2 (0x00007fead2e43000)
	libblkid.so.1 => /lib64/libblkid.so.1 (0x00007fead2bfb000)
	libcap.so.2 => /lib64/libcap.so.2 (0x00007fead29f5000)
	libkrb5.so.3 => /usr/lib64/libkrb5.so.3 (0x00007fead270a000)
	libk5crypto.so.3 => /usr/lib64/libk5crypto.so.3 (0x00007fead24d7000)
	libpcre.so.1 => /lib64/libpcre.so.1 (0x00007fead2268000)
	liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fead203e000)
	libX11-xcb.so.1 => /usr/lib64/libX11-xcb.so.1 (0x00007fead1e3c000)
	libxcb-dri2.so.0 => /usr/lib64/libxcb-dri2.so.0 (0x00007fead1c36000)
	libxcb-dri3.so.0 => /usr/lib64/libxcb-dri3.so.0 (0x00007fead1a33000)
	libxcb-present.so.0 => /usr/lib64/libxcb-present.so.0 (0x00007fead1830000)
	libxcb-xfixes.so.0 => /usr/lib64/libxcb-xfixes.so.0 (0x00007fead1627000)
	libxcb-sync.so.1 => /usr/lib64/libxcb-sync.so.1 (0x00007fead141f000)
	libxshmfence.so.1 => /usr/lib64/libxshmfence.so.1 (0x00007fead121c000)
	libXau.so.6 => /usr/lib64/libXau.so.6 (0x00007fead1018000)
	libXdmcp.so.6 => /usr/lib64/libXdmcp.so.6 (0x00007fead0e12000)
	libcom_err.so.2 => /lib64/libcom_err.so.2 (0x00007fead0c0e000)
	libkrb5support.so.0 => /usr/lib64/libkrb5support.so.0 (0x00007fead0a00000)
	libkeyutils.so.1 => /lib64/libkeyutils.so.1 (0x00007fead07fc000)
	libresolv.so.2 => /lib64/libresolv.so.2 (0x00007fead05e4000)
	libbsd.so.0 => /usr/lib64/libbsd.so.0 (0x00007fead03ce000)

really? on 64bit? hmm ok - then why is this happening? as per backlog this can happen on 32bit systems but 64bit cant link non-PIC code in with PIc so it'd fail entirely. the reason i say this is because this seems to be the only reason mentioned that this would happen:

http://stackoverflow.com/questions/14892101/cannot-load-any-more-object-with-static-tls

explaining "dlopen: cannot load any more object with static TLS" ... what object has static TLS? ... ? googling more now:

https://sourceware.org/ml/libc-alpha/2014-10/msg00134.html

https://bugzilla.redhat.com/show_bug.cgi?id=1124987

http://stackoverflow.com/questions/19268293/matlab-error-cannot-open-with-static-tls

https://lists.launchpad.net/ubuntu-phone/msg09509.html

basically there is an issue and your libc simply may not have enough static tls descriptor slots. :(

as another try can you add this flag to your CFLAGS and rebuild efl?

-ftls-model=global-dynamic

that one? just try... but a PIC binary SHOULD have that by default...

raster added a subscriber: cedric.Sep 20 2016, 6:28 AM
In T4611#70954, @raster wrote:

as another try can you add this flag to your CFLAGS and rebuild efl?

-ftls-model=global-dynamic

that one? just try... but a PIC binary SHOULD have that by default...

No that doesn't help.

Thanks for providing those links to read. It does sound like an issue with my glibc now that I've read up more on it. It seems like I have some options. One would be to rebuild xorg-server, mesa, and my intel drivers *without* NPTL support ... which is certainly feasible. The other is to upgrade to glibc-2.22, which ( according to one of the links you provided ) will "fix" the issue. A third option is to rebuild the same version of glibc I have ( 2.21 ) with the patch to increase the number of static TLS descriptor slots. Sabayon still ships an older version, but I've done crazy things like update my glibc with later Gentoo versions before.

Question: what effect does the above hack to Eo.h have? Can I continue to do this for the foreseeable future? If not, I guess the least invasive way forward for me is to rebuild my *current* glibc as per above.

I'm not sure if there is anything left to do from EFL's perspective. I would assume there would be quite a few installations out there that still have glibc-2.21 or older - Sabayon tends to be pretty bleeding-edge. I'm happy to help provide more info on my setup, test hacks, etc. Otherwise, feel free to close this bug if you think it's primarily my installation that is to blame.

Thanks for your help :)

ok. well you're on an older glibc than me (2.24 for me), likely why i don't see it and you do. perhaps a lot of people have upgrade or have patched glibc's and you don't. not sure. but this issue is really a thorny one. one thing we can do is reduce the # of .so's - that for us unfortunately means merging our libraries. @cedric has plans to merge eina, eo and ecore already. this will reduce TLS slots needed, but we'd likely have to reduce a lot more than that.

we could do actual pthread TLS (getspecific) via eina_tls ... but this moves the cost to runtime inside every api call and that kind of defeats the whole purpose of these being link-time (+thread spawn etc.) costs. i cant really not use __thread there without a whole bunch of other costly overhead.

as for the Eo.h hack? it makes eo "not threadsafe". it means the eo call resolve cache is not threadsafe. at least for now you are unlikely to see many issues, but over time the issues will pile up and you'll get "weird stuff" happening and it'll be hard for anyone to know what/why.

but knowing glibc 2.21 is an issue is useful.

so our options are:

  • use eina_tls BUT this means far far far bigger runtime cost AND more code to execute (have to allocate struct, free it on thread exit, do checks for the key, init if not initted and do the alloc etc.)
  • make eo not threadsafe (this isn't really an option)
  • redesign the cache somehow to work differently without affecting performance or memory footprint or complexity much (this is going to be HARD - can't currently think of a good direction to go)
  • reduce number of .so's we need/load that have TLS segments for __thread (this is the most viable as it is actually our plan to reduce # of .so's and merge them into fewer ones). this is why i added @cedric on this as i think expanding our .so merge might be a good idea just to fix this issue here for older glibc's

merging .so's is not as easy as it sounds though... :/

FYI, rebuilding glibc-2.21 with this patch also fixes things for me:

--- a/sysdeps/generic/ldsodefs.h	2015-02-06 17:40:18.000000000 +1100
+++ b/sysdeps/generic/ldsodefs.h	2016-09-26 13:17:14.228874529 +1000
@@ -389,7 +389,7 @@
 #define TLS_SLOTINFO_SURPLUS (62)

 /* Number of additional slots in the dtv allocated.  */
-#define DTV_SURPLUS	(14)
+#define DTV_SURPLUS	(32)

   /* Initial dtv of the main thread, not allocated with normal malloc.  */
   EXTERN void *_dl_initial_dtv;

This patch was suggested in one of the above links.

thanks! i'l leaving this open mostly because we've been discussing merging efl libs.right now a quick check of terminology and enlightenment show that these link to 35 different efl shared libs not including modules and so on that are dlopen()ed. each lib is basically going to be eating up a slot i think. or if it has any __thread usage it would.

we're talking about reducing these 35 to somewhere between 1-6 or so (like 3-4 in the end, but 1 is theoretically possible). when we merge, this problem hopefully gets reduced... and we sneak under the bar... but until then this will continue to be an issue until libc's are "fixed" as you describe.

raster renamed this task from GL compositing on X11 broken to EFL EO thread safety with _thread causes liker issues e.g. evas gl module loading.Dec 25 2016, 8:34 PM
stefan_schmidt triaged this task as High priority.Feb 10 2017, 6:48 AM
This comment was removed by raster.
zmike edited projects, added Restricted Project; removed efl.Jun 11 2018, 6:52 AM
bu5hm4n edited projects, added efl: rendering, efl: data types; removed Restricted Project.Jun 11 2018, 7:32 AM
zmike edited projects, added Restricted Project; removed efl (efl-1.21).Jun 20 2018, 9:11 AM
zmike lowered the priority of this task from High to TODO.