Page MenuHomePhabricator

E Wizard Crash - Linux and OpenBSD (X11)
Closed, ResolvedPublic

Description

OpenBSD 6.0 amd64 git master EFL and E.

The wizard works but following completion of the wizard there is a crash:

This happens everytime:

Related to T5000???

netstar created this task.Dec 14 2016, 1:02 PM

Steps to reproduce:

  1. Clean user configuration (remove .e etc)
  2. Start E from the console
  3. Complete E wizard
  4. E Crashes
netstar triaged this task as Showstopper Issues priority.Dec 14 2016, 1:09 PM
netstar added a subscriber: ManMower.

Linux too!

@ManMower tested with X11 (crash)
Wayland (no crash)

netstar renamed this task from E Wizard Crash - OpenBSD to E Wizard Crash - Linux and OpenBSD (X11).Dec 14 2016, 1:29 PM

Apparently the weird numbers in that bt are likely the result of openbsd's poisoning on free() and this is quite likely a use after free somewhere.

well this is happening on the next restart of e... right? only the next one. restart e again and it's ok? right?

i saw it too.. but once and only once...

i am beginning to wonder if this has to do with eet changes for windows... because we mmap the cfg file. as the new e saves its first config file while mmaping the old file that it renames on top of?

@cedric ?

what i am seeing is all the module name strings are garbage.... :/

raster reassigned this task from raster to zmike.Dec 14 2016, 10:35 PM
raster added a subscriber: raster.

ok... i have this in gdb
similar to you:

#0  0x00007ff1dc8c485d in pause () from /usr/lib/libpthread.so.0
#1  <signal handler called>
#2  _module_is_nosave (name=0xfefefefe000045a3 <error: Cannot access memory at address 0xfefefefe000045a3>) at src/bin/e_module.c:146
#3  e_module_all_load () at src/bin/e_module.c:280
#4  0x00007ff1e20f04ae in eio_async_end (data=0x20dcd80, thread=<optimized out>) at lib/eio/eio_file.c:510
#5  0x00007ff1de8a4a2c in _ecore_thread_kill (work=0x20dceb0) at lib/ecore/ecore_thread.c:220
#6  _ecore_thread_handler (data=0x20dceb0) at lib/ecore/ecore_thread.c:247
#7  0x00007ff1de8836cb in _ecore_main_call_flush () at lib/ecore/ecore.c:1030
#8  0x00007ff1de8a0b61 in _ecore_pipe_handler_call (len=<optimized out>, buf=0x20fe5d0 "*", p=0x7ff1e5902060) at lib/ecore/ecore_pipe.c:511
#9  _ecore_pipe_read (data=0x7ff1e5902060, fd_handler=<optimized out>) at lib/ecore/ecore_pipe.c:637
#10 0x00007ff1de88efaa in _ecore_call_fd_cb (fd_handler=0x1d5e4c0, data=<optimized out>, func=<optimized out>) at lib/ecore/ecore_private.h:333
#11 _ecore_main_fd_handlers_call () at lib/ecore/ecore_main.c:1983
#12 0x00007ff1de88f666 in _ecore_main_loop_iterate_internal (once_only=0) at lib/ecore/ecore_main.c:2354
#13 ecore_main_loop_begin () at lib/ecore/ecore_main.c:1287
#14 0x000000000043bdcd in main (argc=<optimized out>, argv=<optimized out>) at src/bin/e_main.c:1093
#0  0x00007ff1dc8c485d in pause () from /usr/lib/libpthread.so.0
#1  <signal handler called>
#2  _module_is_nosave (name=0xfefefefe000045a3 <error: Cannot access memory at address 0xfefefefe000045a3>) at src/bin/e_module.c:146
#3  e_module_all_load () at src/bin/e_module.c:280
#4  0x00007ff1e20f04ae in eio_async_end (data=0x20dcd80, thread=<optimized out>) at lib/eio/eio_file.c:510
#5  0x00007ff1de8a4a2c in _ecore_thread_kill (work=0x20dceb0) at lib/ecore/ecore_thread.c:220
#6  _ecore_thread_handler (data=0x20dceb0) at lib/ecore/ecore_thread.c:247
#7  0x00007ff1de8836cb in _ecore_main_call_flush () at lib/ecore/ecore.c:1030
#8  0x00007ff1de8a0b61 in _ecore_pipe_handler_call (len=<optimized out>, buf=0x20fe5d0 "*", p=0x7ff1e5902060) at lib/ecore/ecore_pipe.c:511
#9  _ecore_pipe_read (data=0x7ff1e5902060, fd_handler=<optimized out>) at lib/ecore/ecore_pipe.c:637
#10 0x00007ff1de88efaa in _ecore_call_fd_cb (fd_handler=0x1d5e4c0, data=<optimized out>, func=<optimized out>) at lib/ecore/ecore_private.h:333
#11 _ecore_main_fd_handlers_call () at lib/ecore/ecore_main.c:1983
#12 0x00007ff1de88f666 in _ecore_main_loop_iterate_internal (once_only=0) at lib/ecore/ecore_main.c:2354
#13 ecore_main_loop_begin () at lib/ecore/ecore_main.c:1287
#14 0x000000000043bdcd in main (argc=<optimized out>, argv=<optimized out>) at src/bin/e_main.c:1093

in fact the whole module struct is junk:

(gdb) fr 3
(gdb) p *em
$2 = {name = 0xfefefefe000045a3 <error: Cannot access memory at address 0xfefefefe000045a3>, enabled = 180 '\264', delayed = 1 '\001', priority = 16843009}

enabled should be 0 or 1, delayed 0 or 1 and priority should be sensible like 0 or 1 or 2 or -1. that's a sign of "memory points to something junk". but odd the list node is ... ok. it has sensible content with a correct magic value...

but the eet cfg file is fine. at least if i decode it (eet -d e.cfg config e.src) and read e.src ... in fact if i manually walk the modules config list in gdb so far it seems fine... every struct is sane...

in fact i walked the list all the way to the very very very end without finding a single piece of junk:

(gdb) p *((E_Config_Module *)e_config->modules->data)
$17 = {name = 0x1e4c3ec "luncher", enabled = 1 '\001', delayed = 0 '\000', priority = 0}
(gdb) p *((E_Config_Module *)e_config->modules->next->data)
$18 = {name = 0x1f26fac "wireless", enabled = 1 '\001', delayed = 0 '\000', priority = 0}
...
(gdb) p *((E_Config_Module *)e_config->modules->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->data)
$54 = {name = 0x1e3ad5c "everything", enabled = 1 '\001', delayed = 1 '\001', priority = -1000}
(gdb) p *((E_Config_Module *)e_config->modules->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->next->data)
Cannot access memory at address 0x0

so the data struct itself is fine. false alarm on eet. i bet something is screwing up the stack.. so something called before this iteration of the loop has messed up "em" somehow... the ptr seems sane... so my guess shall be that someone has freed em before we got to it.

so we get l->next with EINA_LIST_FOREACH_SAFE() WHEN we look at the current item, safe here meaning its safe to delete the current item from the list while you walk. my guess is we deleted l->next/ in fact the config file looks like it has a bad wizard attempt to add luncher:

...
group "E_Config_Module" struct {
    value "name" string: "luncher";
    value "enabled" uchar: 1;
    value "delayed" uchar: 0;
    value "priority" int: 0;
}
group "E_Config_Module" struct {
    value "name" string: "luncher";
    value "enabled" uchar: 1;
    value "delayed" uchar: 0;
    value "priority" int: 0;
}
group "E_Config_Module" struct {
    value "name" string: "luncher";
    value "enabled" uchar: 1;
    value "delayed" uchar: 0;
    value "priority" int: 0;
}
group "E_Config_Module" struct {
    value "name" string: "luncher";
    value "enabled" uchar: 1;
    value "delayed" uchar: 0;
    value "priority" int: 0;
}
...

in fact i see 13 copies of the lunhcer module... 13! someone has been really insistent we load this! :-P but ok. that's one bug. we should handle that... but how can em be freed? well it5 could be malloc_perturb doing it (as i set it to 1 so it'd fill freed mem with 0xfe...). but why just the first 4 bytes? odd.

export MALLOC_MMAP_THRESHOLD_=4096
export MALLOC_TRIM_THRESHOLD_=0
export MALLOC_TOP_PAD_=0
export MALLOC_PERTURB_=1
export MALLOC_CHECK_=3

is what i set up to grab this sucker. let me try other perturb values. just FYI on this for now... but someone is doing something odd/wrong to the modules list/memory.

ok. luncher module load screws up the stack. first luncher module load ACTUALLY loads it. (it's not set to delay).

so i would guess the bug is in luncher. walking the modules list before loading luncher is fine. the very next list iteration after is screwed. and yes the fefefefe pattern is malloc purturb marking freed memory.

luncher bug!

raster reassigned this task from zmike to stephenmhouston.Dec 14 2016, 10:54 PM
raster added a subscriber: zmike.

YOU!

aha! now found it. it's sanity code... e_module_new() checks for duplicates in the config ... and someone put them in and it's freeing all the duplicates when we load the config module and crate a new reamodule

who added all the lunchers?

oh lookie e_config.c...

i have a fix and then some here... so i've got it handled.

Aha! I've got the fix. Will commit shortly. The load of luncher shouldn't happen in the loop but instead after it.

That works \o/