Page MenuHomePhabricator

ecore_exe_posix: add timeout exception to avoid deadlock.
Needs ReviewPublic

Authored by eagleeye on Aug 12 2020, 1:54 AM.

Details

Summary

Sometimes fork + exec makes a deadlock, so I add timeout exception handling.

Diff Detail

Repository
rEFL core/efl
Branch
master
Lint
No Linters Available
Unit
No Unit Test Coverage
Build Status
Buildable 17246
Build 11510: arc lint + arc unit
eagleeye created this revision.Aug 12 2020, 1:54 AM
eagleeye requested review of this revision.Aug 12 2020, 1:54 AM
eagleeye added a subscriber: myoungwoon.

how does it deadlock? it never should... either the fork or exec fail and thus the fd we read on will become invalid an read returns... ?

eagleeye added a comment.EditedAug 12 2020, 10:43 PM

how does it deadlock? it never should... either the fork or exec fail and thus the fd we read on will become invalid an read returns... ?

I think child process can't excute exec so statusPipe can't be closed.
but it is reproduced by automation test I can't reproduce it so this is just my guess.

It seems to be similar to this case.
(https://stackoverflow.com/questions/6078712/is-it-safe-to-fork-from-within-a-thread, https://bugzilla.gnome.org/show_bug.cgi?id=738620)

If it can't exec... it'd exit. Like:

if (ok)
{
   /* Setup the status pipe. */
    E_NO_ERRNO(result, close(statusPipe[0]), ok);
    E_IF_NO_ERRNO(result, eina_file_close_on_exec(statusPipe[1], EINA_TRUE), ok) /* close on exec shows success */
    {
       int except[2] = { 0, -1 };

       except[0] = statusPipe[1];
       eina_file_close_from(3, except);
       /* Run the actual command. */
        _ecore_exe_exec_it(exe_cmd, flags); /* no return */
    }
}
_exit(-1);

So if the exec fails... it'll exit. the exit should close the fd on the child fork end as part of process termination. it even calls _exit to avoid atexit shutdown calls that may hang. in fact the fd is closed on exec - so if exec. it sets that fd to be closed on exec so ... i really want to know how this deadlocks as it smells of some kind of more core problem somewhere and this timeout is hiding it. it should just never be needed... :| if this stalls is it possible for you to get into the test system and find out the state of the stalled child process?

BTW - this fork+exec is only done from the mainloop so it doesn't suffer from the threads problem. don't think its a deadlock with a pthread_atfork as the fork would then hang in the parent and never get to the read of the status pipe...

I really would like yo understand exactly what the problem is before putting in a workaround because this is something that needs to be documented here for sure.

Undependend from whatever the reason for this deadlock is:

Whenever the read fails, this for-loop will never be exited

  1. read returns -1
  2. result is -1, ok is 0
  3. for loop continues,
  4. goto 1

I guess the for loop is missing a simple ok in the condition ?

@eagleeye would you mind giving that a try ?

@bu5hm4n indeed - i didnt even notice that. maybe try that first?

Mhm, there is more: if a application has 2 threads running, and the other thread is in a critical section in malloc when we are calling fork, eina_file_close_from will never come to call realloc, which ultimatly deadlocks the child process, deadlocking the parent process on the read call. (That is also something the links given by @eagleeye are suggesting)

raster added a comment.EditedAug 13 2020, 3:28 AM

indeed eina_file_close_from() does call realloc... perhaps this is the problem (thus asking for details of how the child is deadlocked). perhaps this needs to change to work without malloc... actually opendir will alloc too. so that is more of a core problem. but it's necessary to close fd's... so... hmmm...

this means that the core issue is the deadlock - the workaround is not the right thing to do.

i can't find any docs on how to open a dir without an alloc (opendir) in any portable way... i suspect i may have to do a complex split by opendir before fork ... that assumes readdir() doesn't alloc anything. i hope it doesnt because the buffer should already be in DIR * ... i hope...

vtorri added a subscriber: vtorri.Aug 13 2020, 12:22 PM

maybe use readdir_r()

There is alloc code in the eina_file_close_from..

alloc seems to be the cause of the deadlock, child process status is necessary to clarify the problem, isn't it?
I will try to find way to get more information about child process.

Maybe as a quickfix it would be good to *not* call eina_file_close_from at all for now we only have that since 8 months. (And additionally, add a ok into the for loop)

yes - the realloc code in eina_file_close_from can be fixed e.g. to use a fixed buffer on the stack or alloca ... but opendir() is the problem... we can't change that.

I code manually fd = open("dir", O_DIRECTORY); while (getdents(fd, &dirp, 1) > 0) { ... }; close(fd); on linux. I guess this is ok as this relies on /proc and that means this is linux specific anyway so... i just need to make this code path ifdefed for linux anyway. but this will also then be for glibc. I could syscall() directly too. The problem is also handling uclibc, musl, bionic - I need to check and double-check for BSDs though I think they dropped /proc? etc. ... so finding a portable way is going to be a challenge to avoid allocations. I can't even be sure readdir or readdir_p don't do invisible internal allocations. I am certain open and close don't. I'm pretty sure getdents does not by the look of it as it seems a thin wrapper over the syscall.

I'm leaning to a open/getdents solution here with a fallback to opendir/readdir with the caveats that the fallback may deadlock on forks... I think this is the right solution.

first stab - getting rid of heap allocs in eina_file_close_from()

P337

you mighht also want to comment out

unsetenv("NOTIFY_SOCKET");

in ecore_exe_posix.c - just for now... going ot do a bunch of work there to remove that env var from the environment without any possible heap changes but i doubt that is actually happening to you here - that env var being set so it likely is a noop

ok - new stab - no more heap stuff after fork and before exec:

P338

vtorri added a subscriber: alarcher.

and for solaris (openindiana + illumos) ? i've added @alarcher as reviewer

@eagleeye please give code from raster a try. He will land it shortly and we would like confirmation from your side that the issue is really fixed with it. You stated that you can reproduce this in automated testing. We would like to get that confirmed for the release. Thanks

4b4c208d99358941cfe886bc1a87e38c2390f0bd is the commit in upstream ... i have been running it for almost a week myself with no bad side-effects. so i don't think its worse and in theory on paper, it's a fix for this... we can revert if somehow people hit issues now that i have not seen.

I will apply your patch to product branch and test it. Thank you for your help:)