Page MenuHomePhabricator

EFL perceived responsiveness is not as good as it used to be
Open, HighPublic



I was wondering lately what good did the whole eo stuff brought to the EFL. I
wanted to see if there was a noticeable overhead between before and after the
introduction of Eo.

TL;DR; YES! Definitely, but I am actually not really sure why, and I am not even sure this is entirely Eo's fault.

I've made a simple benchmark with graphical program, first linked with EFL
master (0f7c5582a4 for the record) and then with EFL 1.7 (v1.7.10 to be exact).

This test program is strongly inspired from the genlist test with autobounce.
Its code (messy) is here: A "full" genlist
occupying the whole window was required to scroll from top to bottom, and then
from bottom to top, every two seconds, for a total of 10 scrolls, with 500
items. The code was unmodified between the tests, it was just recompiled to be
linked against the appropriate EFL version. The window was maximized, for a
screen resolution of 1920x1080. The following is considered:

  1. memory usage with valgrind+massif,
  2. Linux' perf, to see where each program spends most of its time,
  3. system resource usage, as retrieved by "time",
  4. output of the autobounce benchmark.

Data are consistent from one execution of the program to another, given the
same conditions of execution. The tests were performed with both the OpenGL-X11
and Software-X11 renderers. Both EFL are built from source in -O2.

Before going through the metrics, I have to say that the (my) perceived
responsiveness of the program is superior with EFL 1.7. I find this consistent
after manually fooling around with elementary_test.

1) Memory usage

  • EFL 1.7 (software-x11): peak at 6.3 MiB
  • EFL 1.7 (opengl-x11): peak at 33.7 MiB, in nominal, around 32 MiB
  • EFL master (software-x11): peak at 10.2 MiB, in nominal around 8 MiB
  • EFL master (opengl-x11): 36.1 MiB, in nominal mode, always around 34 MiB

I observe about the same difference between EFL 1.7 and EFL master for
terminology (v0.1.0). So there is definitely a slight increase in memory
usage, but it may not that big of a deal. Considering all the data allocated
by Eo, this seems reasonable.

2) Linux Perf

Or: where are we spending our time?


EFL 1.7:

12,91%  a.out         [.] _op_blend_p_dp_sse3
12,25%  a.out         [.] scale_rgba_in_to_out_clip_sample_internal
 3,82%  a.out         [.] _op_copy_p_dp_mmx
 3,01%  a.out         [.] _op_blend_p_dp_mmx
 1,75%  a.out    [kernel.kallsyms]         [k] clear_page_erms
 1,44%  a.out    [kernel.kallsyms]         [k] shmem_getpage_gfp
 1,38%  a.out         [.] _eina_chained_mempool_alloc_in
 1,36%  a.out         [.] evas_event_thaw
 1,28%  a.out         [.] eina_chained_mempool_free
 1,21%  a.out              [.] _int_malloc

EFL master:

6,06%  a.out          [.] _efl_object_call_resolve
6,03%  a.out        [.] _evas_event_object_list_raw_in_get.part.10
5,36%  a.out          [.] _eo_obj_pointer_get
3,36%  a.out        [.] _evas_event_object_list_raw_in_get_single.constprop.37
3,23%  a.out                [.] _dl_update_slotinfo
2,25%  a.out              [.] _int_malloc
1,63%  Eevas-thread-wk      [.] _op_blend_p_dp_mmx
1,62%  a.out              [.] _int_free
1,52%  Eevas-thread-wk  [kernel.kallsyms]       [k] copy_user_enhanced_fast_string
1,37%  a.out          [.] efl_isa
1,37%  a.out          [.] _efl_object_call_end
1,32%  a.out        [.] evas_object_recalc_clippees
1,32%  a.out        [.] __pthread_getspecific
1,22%  a.out        [.] _edje_part_recalc


EFL 1.7:

2,58%  a.out              [.] _int_malloc
2,42%  a.out         [.] evas_event_thaw
2,30%  a.out         [.] evas_object_event_callback_call
2,20%  a.out         [.] _eina_chained_mempool_alloc_in
2,19%  a.out        [.] __pthread_mutex_lock
2,03%  a.out              [.] __GI___strcmp_ssse3
2,01%  a.out         [.] _evas_event_object_list_raw_in_get.part.4
1,97%  a.out         [.] eina_chained_mempool_free
1,93%  a.out         [.] _edje_part_recalc_single
1,91%  a.out         [.] _edje_part_recalc
1,77%  a.out              [.] __strlen_avx2
1,51%  a.out                 [.] pipe_region_intersects
1,49%  a.out              [.] __GI___printf_fp_l
1,44%  a.out              [.] _int_free
1,34%  a.out        [.] __pthread_mutex_unlock
1,20%  a.out         [.] eina_strbuf_common_append

EFL master:

6,58%  a.out          [.] _efl_object_call_resolve
5,71%  a.out          [.] _eo_obj_pointer_get
4,87%  a.out        [.] _evas_event_object_list_raw_in_get.part.10
3,41%  a.out                [.] _dl_update_slotinfo
2,89%  a.out        [.] _evas_event_object_list_raw_in_get_single.constprop.37
2,36%  a.out              [.] _int_malloc
1,68%  a.out              [.] _int_free
1,61%  a.out        [.] __pthread_getspecific
1,60%  a.out                 [.] _evas_gl_common_context_push
1,55%  a.out          [.] efl_isa
1,39%  a.out          [.] _efl_object_call_end
1,33%  a.out        [.] _edje_part_recalc

Seems that in EFL master we are passing most of our time doing Eo stuff. I was
hoping at least for the software renderer to spend more time drawing than
trying to resolve function calls.

3,4) Autobounce output + system resources

This is definitely the most interesing test, and I'm not sure to like my
understanding of the data.

| Renderder | EFL    | Time Spent (ns) | Frames | Time (ns) per frame | CPU Usage (%) | Total Time (s) |
| Software  | 1.7    | 4_172_005_256   | 279    | 14_953_423          | 24            | 4.53           |
| Software  | master | 4_557_504_828   |  93    | 49_005_428          | 34            | 7.42           |
| OpenGL    | 1.7    | 2_723_094_193   | 279    |  9_760_194          | 16            | 3.55           |
| OpenGL    | master | 4_568_038_976   | 126    | 36_254_277          | 34            | 7.68           |

I think the total time taken to perform the same operation between the two
version of EFL confirm that we definitely lost performance at some point.
This difference is clearly seen when running under valgrind. EFL master is so
slow it cannot perform its scrolling animation, while EFL 1.7 is fluid, even
with the software renderer.

I am intrigued by the significant difference in term of frames. EFL master
displays less 50% frames than EFL 1.7, and takes longer doing so, and consumes
more CPU time.

I know benchmarks should be considered with care, and I'm conscious there might
be flaws in my testing, but they confirm that EFL 1.7 had a way better
perceived responsiveness to me.

I tried to investifate the problem by removing the pointer indirection that Eo
does. I got a noticeable performance improvement (not in perceived
responsiveness though), but there it still a big difference with EFL 1.7:

| Renderder | EFL    | Time Spent (ns) | Frames | Time (ns) per frame | CPU Usage (%) | Total Time (s) |
| OpenGL    | master | 4_513_911_510   | 134    | 33_685_906          | 33            | 7.45           |
| Software  | master | 4_453_696_881   | 100    | 44_536_968          | 32            | 7.09           |

The perf traces below show where we are taking our time:

With OpenGL:

6,39%  a.out           [.] _evas_event_object_list_raw_in_get.part.10
6,15%  a.out             [.] _efl_object_call_resolve
4,01%  a.out                   [.] _dl_update_slotinfo
3,66%  a.out           [.] _evas_event_object_list_raw_in_get_single.constprop.37
2,48%  a.out                 [.] _int_malloc
1,93%  a.out                 [.] _int_free
1,73%  a.out                    [.] _evas_gl_common_context_push
1,54%  a.out           [.] evas_object_recalc_clippees
1,52%  a.out             [.] _efl_object_event_callback_del
1,46%  a.out             [.] _efl_object_call_end
1,42%  a.out                   [.] update_get_addr
1,36%  a.out           [.] _edje_part_recalc
1,34%  a.out           [.] eina_chained_mempool_free
1,28%  a.out           [.] _eina_chained_mempool_alloc_in
1,28%  a.out             [.] _efl_object_event_callback_call
1,25%  a.out           [.] _edje_part_recalc_single
1,21%  a.out                 [.] cfree@GLIBC_2.2.5
1,21%  a.out           [.] eina_hash_find_by_hash
1,19%  a.out                   [.] __tls_get_addr
1,15%  a.out             [.] efl_data_scope_get
1,13%  a.out           [.] _evas_canvas_efl_object_event_thaw
1,06%  a.out                 [.] malloc_consolidate
1,02%  a.out             [.] efl_isa
0,95%  a.out           [.] eina_hash_free
0,90%  a.out           [.] eina_cow_write

With Software:

7,26%  a.out           [.] _evas_event_object_list_raw_in_get.part.10
5,45%  a.out             [.] _efl_object_call_resolve
3,97%  a.out           [.] _evas_event_object_list_raw_in_get_single.constprop.37
3,63%  a.out                   [.] _dl_update_slotinfo
2,45%  a.out                 [.] _int_malloc
1,75%  a.out                 [.] _int_free
1,73%  Eevas-thread-wk  [kernel]                   [.] 0xffffffffaed0f2d7
1,68%  Eevas-thread-wk         [.] _op_blend_p_dp_mmx
1,67%  Eevas-thread-wk  [unknown]                  [.] 0xffffffffaed0f2d7
1,43%  a.out           [.] evas_object_recalc_clippees
1,31%  a.out           [.] _edje_part_recalc
1,28%  a.out             [.] _efl_object_event_callback_del
1,25%  a.out             [.] _efl_object_call_end
1,24%  a.out                   [.] update_get_addr
1,20%  a.out           [.] eina_chained_mempool_free
1,18%  a.out           [.] _edje_part_recalc_single
1,17%  a.out                   [.] __tls_get_addr
1,13%  a.out           [.] _eina_chained_mempool_alloc_in
1,10%  a.out           [.] _evas_canvas_efl_object_event_thaw
1,08%  a.out             [.] _efl_object_event_callback_call
1,06%  a.out           [.] eina_hash_find_by_hash
1,02%  a.out             [.] efl_data_scope_get
0,96%  Eevas-thread-wk         [.] evas_common_scale_rgba_sample_draw
0,95%  a.out                 [.] cfree@GLIBC_2.2.5
0,93%  a.out           [.] eina_hash_free
0,85%  a.out                   [.] __tls_get_addr_slow
0,81%  a.out             [.] efl_isa
0,80%  a.out           [.] eina_cow_write
0,77%  Eevas-thread-wk         [.] _op_copy_c_dp_mmx

It seems I was running all the tests with my mouse cursor over the genlist,
causing events to be raised (mouse,in/mouse,out) I guess. EFL master seems to
have a hard time with them. Results are slightly better without the mouse over,
but still far from what EFL 1.7 shows.

So my observation is that between EFL 1.7 and EFL master, we greatly lost in
perceived responsiveness, and are consuming significantly more CPU time.
But I don't really know _where_. Eo causes obvious slowdown, but it does not
seem to me it is the sole responsible of that.

Thanks for taking the time reading that. Please tell me if my way to collect
the data or my readings are incorrect; but I am convinced that what is shown
to the user appears less responsible between master and 1.7.

I'd love to see EFL master having a perceived responsiveness close to what EFL 1.7 offered.
@raster, @cedric what do you think of all that?

jayji created this task.Jan 6 2018, 9:35 AM

you're right that eo has added overhead. that was always an accepted consequence of it. about 5-10% is what we've seen. (eo calls use up 5-10% almost entirely in call resolve and in ptr get). the ptr indirection is hugely helpful in stability. i don't think we should drop it at all. we should get it to be faster though.

comparing 1.7 and "current efl" is going to compare MUCH MORE than just eo impact. there are a whole bunch of other changes too. so it's going to be complex to disentangle them. also keep in mind that valgrind isn't really a very good "performance measurement" system.

but you are right that performance is something we do need to spend time on, but right now it's not the highest priority. :( we have spent some time on it but here are some things i know from spending time on it:

  1. cost of a func call through eo api is much higher than a regular old style c func call. we have to either do table locks and unlocks or tls lookups (tls is faster than the locks/unlock by a good margin. in total actually it'd be about 3x faster since you'd need 2 lock+unlock cycles vs 1 tls lookup). then there's the table indirection to look up the real pointer (and safety check) and the call resolve lookup too. they all cost. so before we had things like evas_object_move(o); evas_object_resize(o)... but today we could improve the amount of in/out through the api with a efl_gx_geometry_set(o) in 1 call.
  1. legacy apis WRAP eo api's. so you first call the legacy function call that then calls the eo api call. this of course adds overhead. this overhead can never be removed except by moving away from legacy api calls.
  1. i think our lookups in #1 put a lot more strain on the cpu caches. i suspect we need to optimize how we look things up here to be more cache friendly. reducing size of the data to lookup is one thing we can do. ensuring data lines up to l1 cache lines (64bytes generally on most cpu's regardless of architecture). i don't think we've done much here at all in terms of optimizing this. especially in trying to keep data locality better.
  1. to have a proper comparison we need a test that has the same app written in ONLY legacy and one ONLY in eo api. (see no 2 above). and it needs to probably be low level (like just efl_gfx or evas api) to minimize the amount of "internal efl code that still is jumping through legacy". also all the code paths inside efl that this would touch "during benchmarking" should be eo-ified to remove the legacy api jumping around.
  1. in addition to no 4, if we compiled all of efl (evas, elm, edje, ecore etc.) into a single .so/.dll technically link-time optimization COULD in theory remove the legacy api jumps ... i don't know if it does in real life (because legacy api are real symbols/functions etc.) ... but it might be worth investigating.
  1. there will absolutely be other causes of slow-down unrelated to eo. but identifying them over 1.7 -> current is nigh impossible. the best thing to do is to look at what we have now and find ways of optimizing it. at least having "this is what performance was like in 1.7" as a goal/target would be good as you have done.

any help with this would be really appreciated. i really think measuring cpu usage without valgrind in the way is the right way to do this. in the genlist bounce test i added some self-measurement that measured how much cpu time was consumed over a run. something like this would be imho the right way to do it. run with a real normal cpu operating normally, because caching really matters. perhaps running it with realtime priority and without an actual display (xserver). render to memory (buffer engine) and handle events virtually, set cpu to a fixed clockrate (no freq etc. scaling) and then measure. thing like eo and calling interfaces are going to really be affected here. also using perf instead of callgrind is probably better as it won't hurt caching effects as much. when i was last optimizing eo i found adding in goto's actually helps improve performance (there are mails and commit logs with the details), but literally instruction cache inefficiency hurt and forcing code to re-order through gotos had measurable perf differences. if things are this "delicate" in performance, i imagine callgrind is not going to be doing us any favors. :(

oh wait.. you got those numbers from perf?

jayji added a comment.Jan 8 2018, 1:48 AM

the ptr indirection is hugely helpful in stability. i don't think we should drop it at all. we should get it to be faster though.

I was thinking maybe we could make it optional. We could:

  • propose to disable them when configuring the EFL (via a configure/meson option),
  • create a libeo-noid (or with a fancier name) that the user could link against when he is done developing and wants to release.

There is of course an increased maintenance cost, but it is such a hot path that I think it may worth it.

oh wait.. you got those numbers from perf?

Yes with perf. Valgrind was used to do memory snapshots only.

but right now it's not the highest priority. :( [...] any help with this would be really appreciated.

To me, that's important :). I'll try my best to see what can be improved.

raster added a comment.Jan 8 2018, 8:53 PM

I was thinking maybe we could make it optional. We could:

We basically can't anymore because:

we need 1 bit of this id for the "ref bit". we need another bit for "super call". we need another 2 bits for "domain id" to separate main loop vs other thread objects. if we don't separate them we have to do table locking and unlocking on entry and exit to every call (a lock and unlock on entry, lock and unlock on exit). pointers technically can't store any extra metadata. we COULD force the object allocator to always allocate to 16 byte boundaries and thus mask off the lower 4 bits for this but that's a bit ugly. also you then compile efl in 2 modes - safe and unsafe. of course people will go "ooooh give me speeeeeed!" and suddenly apps that used to be stable no longer are. it's giving people a gun and some bullets and saying "if you shoot yourself in the feet, you will run much faster!"... and assuming they will make the sensible choice... :) history tells me that if people are given a choice to shoot themselves in the feet, they will, regardless of the warnings. and often that choice is done by someone else (eg package maintainer) and then the users end up complaining to us of mysterious crashes that us developers don't see. imagine all those code paths you didn't manage to test in every situation? that's where the crashes will happen. :(

To me, that's important :). I'll try my best to see what can be improved.

indeed that's good. trust me. i care about speed too. but it's just pushed down by more important things above it. i hope some of what i said above can help you. i really think that giving up eoid safety is far too much of a "foot shooting fest". i think it'd be better to find ways to speed it up. conceptually it's very simple. take ptr (that's a 32 or 64bit number) and split out bit ranges as fields (integers) and then look up the right table # and row number - in that row will be a ptr to the real obj, along with a copy of the same generation count that's in the eoid (this generation count is important as it's a sanity check for id re-use ... you'd have to alloc the obj with the same generation count and re-use the same table + row number to get a false positive here). i suspect that walking the memory for the table id (split into 2 levels - table and mid table, so it's a 2 level n-arey tree), then reading the row is causing the slowdown here. it's possibly accessing uncached data and we're stalling going all the way to memory maybe 3 times (mid table, table and row). depending on your memory/cpu etc. that could be 300ns right there just stalling on memory before getting the real pointer. so having this data have less hops (maybe only 2 levels) would cut that down to 200ns. finding some way to have the data be more cache friendly might drop this even more. perhaps a per-thread tls "lookup cache" of a few entries? like 8 or 16 of them (1 cacheline? or a multiple of cachelines) which just have a direct eoid->ptr table/array for super fast lookup of the most recent 8 or 16 objects? not sure if cost of maintaining this will be more than any gains. measurement will tell. but my point here is... i think focusing on ways of speeding it up are the way to go rather than "removing it". reality is that speeding things like this up is all about micro-optimizations and thinking carefully about memory, cache design etc. ... actually recently there was an article about using the xmm registers (sse4) as a 160 byte in-cpu cache/buffer. one thing that would be awesome is some SoC's have sram. sram is basically l1cache but without the caching logic. it's manually handled. often its like 64-256kb of this on-cpu memory. putting eoid tables in there would probably be a huge win. but this really depends on a hardware feature and that feature being exposed to userspace. currently kernels don't do that (it's something i've discussed with kernel devs pointing out that mostly this sram goes unused and exposing it to userspace could have amazing benefits).

the call resolve too is the other main vector of pain. i did put in a cache here (i tried various cache sizes but 1 slot seemed to be the best). you'll find it in the eo code with a #define for slot size. this gave a bit of a win. perhaps again, finding ways of optimizing data storage to reduce stalls when a lookup happens might be nice. like the above...

hmm i forgot i did this. i already put in a single entry eoid lookup cache in the eo data table... so already have some amount of caching there. i did a quick dump of eoid lookups during something like the genlist scrolling. here's sample of the eoid's looked up:


so the single entry eoid lookup cache should already be getting good hit rates. the amount of code prior to a hit is pretty simple:

data = _eo_table_data_get();
domain = (obj_id >> SHIFT_DOMAIN) & MASK_DOMAIN;
tdata = _eo_table_data_table_get(data, domain);
if (EINA_UNLIKELY(!tdata)) goto err;

     if (obj_id == tdata->
       return tdata->cache.object;

_eo_table_data_get() is:

static inline Eo_Id_Data *
   Eo_Id_Data *data = eina_tls_get(_eo_table_data);
   if (EINA_LIKELY(data != NULL)) return data;

   data = _eo_table_data_new(EFL_ID_DOMAIN_THREAD);
   if (!data) return NULL;

   eina_tls_set(_eo_table_data, data);
   return data;

and _eo_table_data_table_get() is:

static inline Eo_Id_Table_Data *
_eo_table_data_table_get(Eo_Id_Data *data, Efl_Id_Domain domain)
   return data->tables[domain];

so realistically the tls get will always return non-null except for first use. so if we flatten it out we get:

data = eina_tls_get(_eo_table_data);
domain = (obj_id >> SHIFT_DOMAIN) & MASK_DOMAIN;
tdata = data->tables[domain];
if (EINA_UNLIKELY(!tdata)) goto err;
  if (obj_id == tdata-> return tdata->cache.object;

so a tls fetch, a prefetch request to try pre-seed cache in advance, a bitshift + and, a fetch of a ptr from a very small array (tables is just 4 ptrs), another deref of the ptr and prefetch, check if tdata is null (actually these are in the wrong order ... in case tdata is null - it probably almost never should be), then check domain is not shared to avoid the lock/unlock, then fetch cache id and compare to obj and if match - return.

can we speed this up? hmm. remove the prefetches? they may or may not help. move the cache into data instead of tdata? hmmm. that would not be good because data is the thread's state (what tables it can see) and the tables are the actual eoid tables so when objects are removed or added they are done with tdata... but maybe i can make this work. move it into data... shared domain has a specific domain id. the other domain id's (main and thread) are specific constants too (values are from 0 to 3).

here's perf without changes:

+    6.47%  elementary_test            [.] _efl_object_call_resolve
+    3.87%  elementary_test            [.] _eo_obj_pointer_get
+    3.40%  Eevas-thread-wk          [.] _op_blend_p_dp_mmx
+    2.74%  Evas-preload              [.] inflate
+    1.92%  elementary_test          [.] _evas_render_phase1_object_process
+    1.63%  elementary_test                [.] _dl_addr
+    1.56%  elementary_test          [.] _efl_canvas_group_efl_gfx_position_set
+    1.49%  Eevas-thread-wk          [.] _op_copy_c_dp_mmx
+    1.42%  elementary_test          [.] _edje_part_recalc
+    1.39%  Eevas-thread-wk  [unknown]                   [k] 0x00007ffb18fbfd00

here's perf with my changes to put cache in data (and removed the prefetch lines):

+    6.71%  elementary_test            [.] _efl_object_call_resolve
+    4.23%  Eevas-thread-wk          [.] _op_blend_p_dp_mmx
+    4.01%  elementary_test            [.] _eo_obj_pointer_get
+    2.04%  elementary_test                [.] _dl_addr
+    1.88%  Eevas-thread-wk          [.] _op_copy_c_dp_mmx
+    1.68%  Eevas-thread-wk  [unknown]                   [k] 0x00007f0690c2ed00
+    1.65%  elementary_test          [.] _evas_render_phase1_object_process
+    1.53%  elementary_test          [.] _efl_canvas_group_efl_gfx_position_set
+    1.44%  elementary_test          [.] _edje_part_recalc
+    1.36%  elementary_test                  [.] _dl_update_slotinfo
+    1.32%  elementary_test            [.] efl_isa
+    1.24%  Eevas-thread-wk  [kernel.vmlinux]            [k] clear_page_erms
+    1.22%  Eevas-thread-wk          [.] evas_common_font_glyph_draw
+    1.20%  elementary_test          [.] _edje_part_recalc_single
+    1.03%  elementary_test          [.] _evas_object_intercept_call_evas

close. perhaps a little worse?

      │             // Check the validity of the entry                                                    
      │             if (tdata->eo_ids_tables[mid_table_id])                                               
 1.91 │       mov    0x28(%rcx,%rdi,8),%rcx                                                               
 0.36 │       test   %rcx,%rcx                                                                            
      │     ↓ je     140                                                                                  
      │               {                                                                                   
      │                  _Eo_Ids_Table *tab = TABLE_FROM_IDS;                                             
 0.15 │       movswq %dx,%rdx                                                                             
 7.48 │       mov    (%rcx,%rdx,8),%rdx                                                                   
      │                  if (tab)                                                                         
 1.08 │       test   %rdx,%rdx                                                                            
 0.02 │     ↓ je     140                                                                                  
      │                    {                                                                              
      │                       entry = &(tab->entries[entry_id]);                                          
      │                       if (entry->active && (entry->generation == generation))                     
      │       movswq %ax,%rax                                                                             
      │       shl    $0x4,%rax                                                                            
 0.32 │       add    %rdx,%rax                                                                            
25.62 │       testb  $0x1,0x22(%rax)                                                                      
 1.42 │     ↓ je     140                                                                                  
 2.11 │       movzwl 0x22(%rax),%edx                                                                      
 0.09 │       mov    %ebx,%ecx                                                                            
 0.15 │       and    $0x3ff,%ecx                                                                          
 1.17 │       shr    %dx                                                                                  
 0.45 │       and    $0x3ff,%edx                                                                          
 0.94 │       cmp    %ecx,%edx                                                                            
      │     ↓ jne    140

a bit test shouldn't take THAT much time... considering the rest, (also that mov taking 7.48% is bad), but my guess is it's taking that long (and the move) due to stalling on memory. my cpu caching argument i guess, as the cmp later that's checking generation works fast. the annotation is similar before and after. the testb is still horrible.

so here is a thought. on 64bit our pointers ate 2x as big. that's worse cache coherency. if we can compress ptrs down to 32bits ... life would be better, right? every table row is:

typedef struct
   /* Pointer to the object */
   _Eo_Object *ptr; // 32 or 64bits
   /* Indicates where to find the next entry to recycle */
   Table_Index next_in_fifo; // 16bits
   /* Active flag */
   unsigned int active     : 1; // 1 bit
   /* Generation */
   unsigned int generation : BITS_GENERATION_COUNTER; // 7 or 10 bits

} _Eo_Id_Entry;

so reality is the entries would end up being padded out to 16 bytes per entry due to alignment. if we could get out obj ptr down to 32bits (on 64bit) we'd improve our caching... right? to to do this i've kind of had an idea for a while... a custom allocator that returns 32bit ptrs. they are actually a 32bit OFFSET from a single memory region that has a known global base address, so

void *realptr = (ptr32 << 4) /* 16 byte aligned */ + base32ptr;

that base32ptr should stay cached as it'd be accessed very often... all we need is a malloc-like allocator that can allocate a given slice of memory within a "virtually large" block. for 32bit we can just pass through to malloc and not do any of the bitshifts or adds. on 64bit - the above. throw in a bitshift and an add and hope that uses less cycles than the ones we lose to bad cache hits? we'd need a decent allocator algorithm through. but ... this COULD help. in theory. a test might be to make a quick and dirty allocator that does this and see. a bit of work... but an idea. maybe it won't work and we need something else?

jayji added a comment.Jan 13 2018, 2:09 PM

I've made some progress on the _eo_obj_pointer_get. I'm not really sure why there was two tables to reach the final pointer, but I now use only one table. So I end up with one less indirection. I managed to grab some more precious frames by using aggressive inlining of code per domain and a jump table with pointer to labels (I've always wanted to use that!!). And finally, I used the AVX2 vectorization to retrieve in one go the information from the Eo_Id.

I feel like I've exhausted by optimization power on this function. "Compressing" the pointer size may be a great idea. I haven't thought too much on how to implement this. It seems quite difficult to guarantee that the offset will always for a given run. I mean: if we do something like

void *const base_1 = malloc((1 << 12) * 2); /* 2 Pages */
void *const base_2 = mmap(NULL /* Can we provide a hint?! Not sure how. */, ...);

and if we have too many objects allocated, handling another offset seems tedious. Does this mean we should pre-allocate the room for ALL the objects?

Well, now the _eo_obj_pointer_get function looks like that:

_Eo_Object *
_eo_obj_pointer_get(const Eo_Id obj_id, const char *restrict func_name, const char *restrict file, int line)
   static const void *const jump[] = {

#ifdef __AVX__

   const __m256i src = _mm256_set_epi64x(obj_id, obj_id, obj_id, obj_id);
   const __m256i shift = _mm256_set_epi64x(SHIFT_DOMAIN, 0, 0, SHIFT_ENTRY_ID);
   const __m256i masks = _mm256_set_epi64x(MASK_DOMAIN, MASK_GENERATIONS, MASK_OBJ_TAG, MASK_ENTRY_ID);

   const __m256i shifted = _mm256_srav_epi32(src, shift);
   const __m256i result = _mm256_and_si256(shifted, masks);

   const int64_t domain = _mm256_extract_epi64(result, 3);
   const int64_t generation = _mm256_extract_epi64(result, 2);
   const int64_t tag_bit = _mm256_extract_epi64(result, 1);
   const int64_t entry_id = _mm256_extract_epi64(result, 0);

   const unsigned int domain = (obj_id >> SHIFT_DOMAIN) & MASK_DOMAIN;
   const size_t entry_id = (obj_id >> SHIFT_ENTRY_ID) & MASK_ENTRY_ID;
   const unsigned int generation = obj_id & MASK_GENERATIONS;
   const Eo_Id tag_bit = (obj_id) & MASK_OBJ_TAG;

   goto *jump[domain];

do_domain_main: EINA_HOT /* That's a new attribute for labels */
        if (obj_id ==
          return _eo_main_id_table.cache.object;

        if (EINA_UNLIKELY(!tag_bit ||
                 (entry_id >= _eo_main_id_table.count)))
          goto main_err;

        register const Eo_Id_Entry *const entry = &(_eo_main_id_table.entries[entry_id]);

        if (EINA_LIKELY(entry->data.generation == generation))
             // Cache the result of that lookup
             _eo_main_id_table.cache.object = entry->data.ptr;
    = obj_id;
             return _eo_main_id_table.cache.object;

        goto main_err;

main_err: EINA_COLD /* That's a new attribute for labels */
   if (obj_id)
     _eo_obj_pointer_invalid(obj_id, &_eo_main_id_data, domain, func_name, file, line);
   return NULL;

/* ... next domains ... */

And the entries in the table are:

typedef union
   /* Actual data. With alignment: 128-bits */
   struct {
      _Eo_Object *ptr;
      unsigned int generation;
   } data;

   /* To handle free elements in the table */
   struct {
      uintptr_t null;
      size_t next;
   } meta;
} Eo_Id_Entry;

Pointer compression would make the entries on 64 bits only. If we restrict even more, we may go down to 32 bits. Aside from the pointer compression you suggested, I'm having a hard time figuring out what more can be done on this function. At least, with my benchmark my changes allow 10 more frames to be displayed, lowering the time spend on one frame to 33_569_797 ns (versus 36_254_277 ns before). Yay!

I'll soon propose my changes on Diffusion, once I've cleaned-up my mess. Meanwhile, I'll give more thought to the pointer compression.

having fun? it actually sounds like it. sometimes. often, optimizing can be fun. and you learn things. :) do you have any ... results? like it made a difference?

I'm not really sure why there was two tables to reach the final pointer, but I now use only one table

i think it was to use less memory so we only need to keep a list of ptrs to sub-tables in the mid table and if a sub table ptr set is empty we can remove all of it.

"Compressing" the pointer size may be a great idea. I haven't thought too much on how to implement this. It seems quite difficult to guarantee that the offset will always for a given run...

actually you can't just compress any pointer from malloc and friends. it could be anywhere in memory. my thought was literally to implement a custom malloc implementation that GUARANTEES that everything is allocated from a specific region that will be always "within a 32bit value". well 32bit multiplied by the alignment we use (so if 8 byte the implementation can allocate up to a theoretical 32gb of memory which should be plenty for eo objects... if 16bytes, the 64gb of memory). we could even have multiple such regions and so allocating objects from one pool, and then other things from another pool (different base ptrs). the problem here is:

  1. converting all the pointer accesses to a special static inline or macro that will expand the "reference" to a real ptr as well as all the places where they are allocated and freed being changed.
  2. writing a semi-efficient malloc implementation that uses a SINGLE arena (region of memory) and that is not going to fragment too badly AND can give memory back to the system when large enough regions become empty. allocate the arena memory with an anonymous mmap(), relying on the os to actually only make memory we TOUCH real. allocate all 32 or 64gb this way. giving back can be done via madvise(). trying to force crashes on as much unallocated memory as possible could be done with mprotect() (make regions we have no allocations in non-writable maybe even non-readable too). the allocator maybe could be a hybrid of buddy + freelist at the start and can be improved over time. i once looked at jemalloc to see if it could be adapted but it was going to be a lot of work. so something simpler might do. i was mulling writing at least a test with a pretty dumb allocator to see if it made a difference (then have the same allocator either just return full 64bit ptrs or the compressed style to see the difference in performance so we use the same allocator in both cases just testing to see if the ptr size makes a difference). this is a THEORY that it might help. i don't know for sure. it's a guess based on the profiling i've done.
void *const base_1 = malloc((1 << 12) * 2); /* 2 Pages */
void *const base_2 = mmap(NULL /* Can we provide a hint?! Not sure how. */, ...);

yeah. not really doable. see above. i think allocating one large region then having an allocator divide it up (the same thing malloc does ti the heap that gets sbrk()'d up and down, but always allocating from the single heap and never dividing out to separate heaps like libc malloc will do with mmap if the allocation is i think more than 256k).

And finally, I used the AVX2 vectorization

oooh this is a double edged sword. when avx2 is used, then the cpu actually clocks down more. it cant turbo boost on intel and i think even clocks lower than normal also forcing other cores to clock down too... this has to be benchmarked to see if these clocking effects will happen for minimal instruction usage like above or not. have you compared the speed between avx2 and the c?


is _ a ,? like its 33569797ns ? did you check with perf? like has the % gone down? i have found the ns displayed from the genlist bounce test though to be a bit variable. i have forced by cpu to max clock/max performance etc. .. :) btw - don't pay attention to number of frames. unless you are not hitting 60fps (or whatever the framerate is meant to be) and are thus lagging due to cpu being 100% ... pat attention to the cpu time per frame. that's the realistic cost as our real costs are per frame (unless we're cpu bound and not even managing the target framerate). how many bounces do you use? i use 50:

ELM_TEST_AUTOBOUNCE=50 elementary_test -to genlist

over 5 runs i get:

NS since 2 = 2650501242 , 1470 frames = 1803062 / frame
NS since 2 = 2211807153 , 1422 frames = 1555419 / frame
NS since 2 = 2649508337 , 1470 frames = 1802386 / frame
NS since 2 = 2307449171 , 1421 frames = 1623820 / frame
NS since 2 = 2623945140 , 1470 frames = 1784996 / frame

the test should be good in that it only starts benchmarking from bounce 2 so the window should already be up and visible and "stable". so be careful of your benchmarking.

i gave 32bit ptrs a shot. i implemented a "not too bad" 32bit ptr system. at least benchmark-wise it seems comparable to glibc with 2 threads (it's worse with more and better with 1 - i just used a single simple spinlock for the arena/pool). i can;'t measure a difference. cpu time needed is like within 0.05% of each-other and perf doesn't seem to show any appreciable percentage differences. :( well there goes that theory.

at least i now have a not-to-shabby memory allocator of my own. i could improve it by using free lists instead of alloc lists, and maybe by binning the free list segments, but in this case you can't even see these mem allocator funcs in the first few screens of perf results.

here it is with 32bit ptrs:

6.08%  elementary_test                [.] _efl_object_call_resolve
4.33%  Eevas-thread-wk              [.] _op_blend_p_dp_mmx
3.93%  elementary_test                [.] _eo_obj_pointer_get
2.06%  elementary_test                    [.] _dl_addr
2.05%  Eevas-thread-wk              [.] _op_copy_c_dp_mmx
1.96%  Eevas-thread-wk  [kernel.vmlinux]                [k] clear_page_erms
1.76%  Eevas-thread-wk  [unknown]                       [k] 0x00007f852aac9d00
1.59%  elementary_test              [.] _evas_render_phase1_object_process
1.54%  elementary_test              [.] _edje_part_recalc
1.35%  elementary_test                      [.] _dl_update_slotinfo
1.32%  Eevas-thread-wk              [.] evas_common_font_glyph_draw
1.24%  elementary_test              [.] _efl_canvas_layout_efl_gfx_position_set
1.14%  elementary_test                [.] efl_isa
1.04%  elementary_test              [.] pthread_spin_lock

And with regular 64bit pointers:

6.21%  elementary_test            [.] _efl_object_call_resolve
4.63%  elementary_test            [.] _eo_obj_pointer_get
3.55%  Eevas-thread-wk          [.] _op_blend_p_dp_mmx
2.66%  Evas-preload              [.] inflate
1.77%  elementary_test                [.] _dl_addr
1.73%  elementary_test          [.] _evas_render_phase1_object_process
1.56%  Eevas-thread-wk          [.] _op_copy_c_dp_mmx
1.54%  Eevas-thread-wk  [kernel.vmlinux]            [k] clear_page_erms
1.45%  Eevas-thread-wk  [unknown]                   [k] 0x00007f5740b7fd00
1.37%  elementary_test          [.] _edje_part_recalc
1.32%  elementary_test                  [.] _dl_update_slotinfo
1.18%  elementary_test            [.] efl_isa
1.07%  elementary_test          [.] _efl_canvas_group_efl_gfx_position_set
0.97%  elementary_test          [.] _evas_object_intercept_call_evas
  • I didn't thought about vectorization taking more CPU. I'll have a look. But I've seen less time spent by frame.
  • My _ are just visual separators, they have no meaning to the measure. You can see it as a non-secable whitespace

I feel like the time spent by frame is easier to interpret than the % given by perf, and since it is the end result I mostly concentrate on decreasing this number.
I am using the same autobounce code from the elm_test. At first, I didn't take elm_test itself because I wanted to use the same code to see the difference between EFL 1.7 and EFL master. For D5738 (and my current benchmarks), I'm using elm_test with 100 bounces (over 3 executions).

Your allocator seems really nice! If the action to allocate something is just push an offset, it may be easy to use a lock-free/wait-free sentinel, and get rid of the spinlock.
I've submitted D5738, which brings some bits of perfs here and there (with extra commits I forgot to prune before doing the arc diff). I'll wait for your feedback to submit work that is built on top of that.

Your allocator seems really nice! If the action to allocate something is just push an offset

Well in real life it looks like the memory footprint goes up - i checked. it seems to fragment more than glibc. so there is a cost. speed-wise with a whole bunch of allocs and frees it does ok, but spatially... not so much. also to alloc and free requires modifying a linked list of allocations, thus it needs the lock no matter what. this could definitely be improved, but my goal was to have a good enough implementation to test the theory of "make ptrs smaller == speed up eo". at least i know it'd cut memory usage IF it didn't fragment as badly.

it may be easy to use a lock-free/wait-free sentinel, and get rid of the spinlock.

This really didn't even appear to be an issue from my perf analysis, but given the above, i can't nuke the spinlock. my understanding of the trick things like glibc use to improve this is that it uses per-thread "TLS" arenas - or small ones to allocate in so these per-thread lists are small and i assume they regularly are merged back into a global list. only these merges require locking. the idea is that common patterns like short-term malloc+free for temporary variables in a function or just in a child func passing back to a parent then don't pollute the global memory pool, but for eo this wouldn't be an issue. :) at least i could remove the lock if i made an unsafe version only for thread-local objects which atm are almost ALL objects in eo... that is the far more obvious path for removing the lock as we know we don't need it. it'd be 1 mem arena per thread for thread local objects (without a lock) and 1 for shared (with a lock). but this poses a lot of problems. mmaping a large region may fail if the total amount of memory mapped into a process exceeds the total mem on a system... i noticed the largest mem region i could mmap actually was 1gb. even on 64bit and my system has 32gb of real ram (no swap though). my kernel has defaults with overcommit_ratio set to 50. either way this is something that we can't modify so it's a limitation and thus even though only pages we touch really count, there is a limit to the total size we can pre-allocate. i would need to implement a mremap() setup where i expand the region, but then eventually a remap will fail as it encounters another mapping afterwards and can't expand without relocating, and relocating can't be done as the entire mem block right now is a single one including the header for memory accounting. everything lives in that single mapping. it makes calculating the real ptr very fast (its arena address + (ptr << 4)). i can also even take an arena address and figure out the arena header from it. anyway... it's a first try at this but not that nice. it needs much work. it might be worth that work IF this led to an appreciable speedup. it doesn't. :(

I've submitted D5738

I'll take a look

bu5hm4n triaged this task as High priority.Jun 10 2018, 12:27 PM
zmike edited projects, added Restricted Project; removed efl.Jun 11 2018, 6:53 AM
bu5hm4n edited projects, added efl: data types; removed Restricted Project.Jun 11 2018, 7:45 AM
zmike edited projects, added Restricted Project; removed efl (efl-1.21).Jun 15 2018, 7:34 AM
zmike added a subscriber: ManMower.
ManMower added a project: Restricted Project.

Guess I'll dig into this a bit.

btw, looks like about 2/3 of those pointer lookups hit the fast path here and execute very quickly.

Looks like maybe some low hanging fruit in all that list activity.

With @bu5hm4n's help I've managed to disable the focus manager. That code has performance issues, but I think they're mostly understood at this point and @bu5hm4n has some good ideas to mitigate them.

Without focus manager doing linked list junk to pollute perf results we're back to some very large numbers for eo

7.38%  junk                [.] _eo_obj_pointer_get
5.86%  junk                [.] _efl_object_call_resolve
3.21%  junk                [.] _vtable_func_get
3.05%  junk                [.] _event_callback_call
2.81%  junk                [.] _efl_object_event_callback_del
2.47%  junk                      [.] pipe_region_intersects
1.99%  junk                [.] _efl_data_scope_get
1.85%  junk              [.] evas_object_recalc_clippees
1.56%  junk                   [.] _int_malloc

On a software rendered run on a 1920x1080 display those top 4 functions are still ahead of evas_common_scale_rgba_sample_draw.

Still not yet sure if genlist is doing infinitely more eo object interactions than it should, or if overhead per interaction is unreasonably high.

raster added a comment.EditedJul 24 2018, 1:05 AM

you know... it really depends on your hardware. for me _efl_object_call_resolve and _eo_obj_pointer_get are reversed on my desktop.

7.76%  elementary_test           [.] _efl_object_call_resolve
5.16%  elementary_test           [.] _eo_obj_pointer_get
4.43%  Eevas-thread-wk         [.] _op_copy_c_dp_mmx
1.66%  elementary_test           [.] efl_isa
1.50%  Eevas-thread-wk  [kernel.vmlinux]          [k] clear_page_erms
1.09%  elementary_test           [.] _efl_object_event_callback_legacy_call
1.04%  Eevas-thread-wk  [kernel.vmlinux]          [k] shmem_getpage_gfp.isra.6
1.02%  elementary_test        [.] __pthread_getspecific
0.99%  elementary_test         [.] _edje_part_recalc
0.98%  elementary_test         [.] evas_common_tilebuf_add_redraw
0.97%  Evas-preload            [.] inflate
0.95%  Eevas-thread-wk  [kernel.vmlinux]          [k] native_irq_return_iret
0.92%  elementary_test         [.] evas_render_updates_internal_loop
0.91%  elementary_test           [.] _efl_object_call_end
0.91%  Eevas-thread-wk         [.] evas_common_font_glyph_draw
0.91%  Eevas-thread-wk         [.] _op_blend_p_dp_mmx

and oddly evas's rendering for me comes up much higher. i also notice variances in runs by 1% or so for some of the top 2 or 3 entries. there is a cache in the eoid lookup already (single slot cache). i've done numbers and see about 64% hit rate of the single slot cache. i've tried a 2 slot cache with a 72% hit rate and 4 slot cache and it gets a 88% hit rate, but it gets slower as cost of the cache is now higher than what it saves, so a single slot seems to be about the best compromise. so with the hit/miss tracking and printing:

4 slot:

7.34%  elementary_test             [.] _efl_object_call_resolve
6.87%  elementary_test             [.] _eo_obj_pointer_get
3.97%  Eevas-thread-wk           [.] _op_copy_c_dp_mmx
3.91%  Evas-preload              [.] inflate
1.52%  elementary_test             [.] efl_isa
1.25%  Eevas-thread-wk  [kernel.vmlinux]            [k] clear_page_erms
1.05%  elementary_test             [.] efl_data_scope_get
0.97%  elementary_test          [.] __pthread_getspecific

2 slot:

6.95%  elementary_test                [.] _efl_object_call_resolve
5.73%  elementary_test                [.] _eo_obj_pointer_get
4.36%  Eevas-thread-wk              [.] _op_copy_c_dp_mmx
3.54%  Evas-preload                 [.] inflate
1.40%  elementary_test                [.] efl_isa
1.35%  Eevas-thread-wk  [kernel.vmlinux]               [k] clear_page_erms
1.02%  Eevas-thread-wk  [kernel.vmlinux]               [k] shmem_getpage_gfp.isra.6
0.96%  elementary_test                [.] _efl_object_event_callback_legacy_call

1 slot:

7.12%  elementary_test           [.] _efl_object_call_resolve
5.23%  elementary_test           [.] _eo_obj_pointer_get
4.05%  Evas-preload            [.] inflate
3.97%  Eevas-thread-wk         [.] _op_copy_c_dp_mmx
1.55%  elementary_test           [.] efl_isa
1.42%  Eevas-thread-wk  [kernel.vmlinux]          [k] clear_page_erms
0.95%  elementary_test           [.] _efl_object_event_callback_legacy_call
0.95%  Eevas-thread-wk  [kernel.vmlinux]          [k] shmem_getpage_gfp.isra.6

so a single slot seems the best. given that 64% hit rate that means 64% of the time it executes very little code. since all the objects we look up are not shared then the if (EINA_LIKELY(domain != EFL_ID_DOMAIN_SHARED)) branch will always be taken, so 64% of the time all it executes is:

data = _eo_table_data_get();
domain = (obj_id >> SHIFT_DOMAIN) & MASK_DOMAIN;
tdata = _eo_table_data_table_get(data, domain);
if (EINA_UNLIKELY(!tdata)) goto err;

     if (obj_id == tdata->
       return tdata->cache.object;

my experience is if's are really cheap. originally eo's tables were global with NO LOCKS. this was madness and we had random crashes sometimes... i added spinlocks, but this was costly due to atomics and stalls, so because of that. i then changed design to thread-local storage and eoid tables per thread. pthread local storage (__pthread_getspecific) was cheaper that a spinlock+lock cycle and actually we had to have 2 locks and unlocks to minimize contention. now microbenchmarks:

spin   0.65876 83231301 = 6.361 ns / lock+release
lock   1.46516 83231301 = 14.425 ns / lock+release
self   0.20352 83231301 = 1.809 ns / self_get
tls    0.15944 832312a9 = 1.594 ns / get

a spinlock lock+unlock on my system is 6.3ns, a tls lookup is 1.6ns ... this is one of the major reasons behind the tls design, not to mention also solving "thread safety". it avoids the locks and unlocks. since we needed 2 that was 12.6ns vs 1.6ns ... so getting rid of the tls lookup i don't think is viable as the only alternatives are either global single table with no locks (forget threads with eo - ever..) or a single global with locks (~8x the overhead vs tls).
now looking at the code that is run, let's examine:

data = _eo_table_data_get();

this is basically the tls lookup. check it out - but it's a tls lookup, if it fails, create new table and return that, so basically this always is just the first 2 lines of code:

Eo_Id_Data *data = eina_tls_get(_eo_table_data);
if (EINA_LIKELY(data != NULL)) return data;

it's already a static inline... so other than the error handling code getting inlined and maybe causing a cache miss when fetching the following code... i can't see much to fix here. all i can imagine is making a special case just for _eo_obj_pointer_get(). i tried custom inlining it with error handling at end of _eo_obj_pointer_get() to try move it out of the way - didn't change anything.

so then the rest of the code:

domain = (obj_id >> SHIFT_DOMAIN) & MASK_DOMAIN;
tdata = _eo_table_data_table_get(data, domain);
if (EINA_UNLIKELY(!tdata)) goto err;

a prefetch to try optimize and fetch data before needed (an extra instruction cost for the possible benefit of missing a stall). there is another prefetch there too for the same reasons. disabling those seems to get a little benefit:

4.75%  elementary_test                [.] _eo_obj_pointer_get

so pretty much no change.the bitshift + mask is unavoidable nd i don't see how it can be done any better. if a compiler could use special instructions (mmx/sse etc.) to speed it up, it can, but even basic shift+ mask instructions should be about as fast as it gets.

so we have another if for an error, an if when not a shared domain, an if to match cache... they are pretty much unavoidable. so all we have is:

tdata = _eo_table_data_table_get(data, domain);

and all that is:

return data->tables[domain];

so ... my conclusion is... there is basically nothing to optimize there. absolutely nothing to be done. we've prefetched as much as we can. we don't do anything we don't need to. at least not in the 64% of the hot path cases. does the cache then hurt? no. if i turn it off:

6.65%  elementary_test                [.] _eo_obj_pointer_get

so... cache -> good stuff (as almost always :)). so ... there are only 2 ways to go from here that i can see:

  1. optimize the non-cached path which is more complex. that's all the rest of the code in the if (domain not shared):
mid_table_id = (obj_id >> SHIFT_MID_TABLE_ID) & MASK_MID_TABLE_ID;
table_id = (obj_id >> SHIFT_TABLE_ID) & MASK_TABLE_ID;
entry_id = (obj_id >> SHIFT_ENTRY_ID) & MASK_ENTRY_ID;
generation = obj_id & MASK_GENERATIONS;

// get tag bit to check later down below - pipelining
tag_bit = (obj_id) & MASK_OBJ_TAG;
if (!obj_id) goto err_null;
else if (!tag_bit) goto err;

// Check the validity of the entry
if (tdata->eo_ids_tables[mid_table_id])
     _Eo_Ids_Table *tab = TABLE_FROM_IDS;

     if (tab)
          entry = &(tab->entries[entry_id]);
          if (entry->active && (entry->generation == generation))
               // Cache the result of that lookup
               tdata->cache.object = entry->ptr;
               tdata-> = obj_id;
               return entry->ptr;

the masking and shifting is about as good as it'll get - like the above masking/shifting in the core path before the cache hits. again - mmx/sse/neon style simd could speed it up maybe.. but a compiler should take care of that these days... the if's after the masking/shifting fun are not really optional, and then we have possibly the biggest costs - looking up the eiod in the table. the chances are here that its cache misses that hurt this, but what is there we can do? don't have a mid table and sub table? this will lead to more memory usage in a single larger parent table (fragmentation etc.)... also having a different eoid allocation algorithm that puts object id's closer together might help... but it can only help the 36% of misses and given things jump from ~5% to 6.6% without a cache at all, i imagine that we aren't going to make a dent here, but the only thing here i can see is flattening the table to a single level, which means no breaking up the table into blocks (mid table parent array then sub tables), and having to use things like madvise() to hand back chunks of a table that may now be empty/unused... and we need to encourage this to happen. again - my guess is that even if this is done, it'll be "much of a muchness". no gain or loss. just code changes. i may be wrong, and we would only know if it's tried, but i wouldn't invest my time here.

the other options is:

  1. look up objects less.

this is where i think there is more mileage to be had. for example: replace all move+resize calls with geometry_set calls. that halves the lookups for those. use the eo_add() ability to call functions at creation time. this should avoid any lookups as the id would be generated and the eo obj ptr re-used every time for every setup call. what we have is a lot of legacy code that is basically hurting us because it is legacy code. it's not taking advantage of the ability for eo to avoid the eoid overhead.

so that's my take. i've profiled the eoid lookup a lot. also the call resolve too. i'm basically struggling to find ways of speeding these up. i'd especially sunk time into the eoid lookup and there are big reasons why i stick by "tls eoid table per thread" on mailing lists and push back - because i've invested a lot of time trying to make this stuff fast and good and my take is that it's the best solution for the problem overall.

now of course if you have some fresh/new ideas that i haven't covered here... please splat them out. i am not trying to shut down ideas. i'm trying to save time by pointing at what i've already spent time on and not to re-do that unless you truly have spotted something i have not. also be aweare ythat every small change hre will have differing results on different architectures. armv6 vs v7 vs v8 vs ix86 vs x86-64 vs multiple generations of different x86 cpu's etc. etc. - my experience is to see subtle differences in this micro-optimization land and that unless you get big wins that translate across architectures/cpu generations (or at least don't hurt some of them), it's probably not worth the pursuit beyond what has already been done. my take is the wins are higher up the stack like above.

maybe also combining more functions, like:


that's 2 eoid lookups there instead of 1. (in _item_position() in elm_genlist.c - it also has a move+resize there too as above). so for that function if we used the geom_set and had a combined thaw_eval we'd go from 6 lookups to 4. but of course this assumes that this is one of the major sources of eo lookups. there are probably more and finding the big ones and fixing them will help the most i think. and actually... i lied. that function is worse. it calls evas_object_evas_get() 3 times which is 3 more lookups. :) so those can be looked up once so go from 3 to 1 extra lookups, so from 9 to 5 lookups. having an efl_batch that is like efl_add that allows us to skip lookups for a batch of calls on an obj would be very helpful too. this was the original point of eo_do() - so you can batch things. that was lost along the way...

fyi -i quickly removed redundant calls in elm_genlist.c and @ of calls went from 33226752 to 33095680. .. that's all of 0.3% .. but fairly quick and simple to do with minimal dangers. i need to callgrind to see where most of the calls are coming from.

I got a bit more information on the "_event_callback_call" call.

We can make the internal datastructure for the event subscriptions a inarray.
Each element consists of { Event, Array for subscriptions } the array is sorted by the event pointer.
Each element in "Array for subscriptions" is : {callback, data, generation, priority} the array is sorted by priority.

m = Number of different events where callbacks are subscribed for.
n = Total number of subscriptions
i(x) = Number of subscriptions for a event x

Where we can say that m <=n and i(x) < n

Fetching a event to look for callbacks is O(log m).
Where we then can fetch if there is a callback at all or not, which is then O(1)
Executing the callbacks in the end then is O(i(x)) which is ... well neccessary.
So from this POV the event submission should scale way better then before. Since before the execution is O(log n) in EVERY case.

I cannot tell how much faster it is until i have implemented it. However, I think this is big tradeoff, since i am seeing a lot of walks to the event subscriptions with 20 elements, where only 1 is listening to a move event.

raster added a comment.EditedJul 24 2018, 2:00 AM

8% of eo calls are efl_ui_focus_object_prepare_logical_none_recursive() called from efl_ui_focus_object_prepare_logical() ... that's 8% of all eoid lookups... just by itself. :( focus manager is certainly something to look at. that's more than 2x as many calls as efl_event_callback_call ... :(

So from this POV the event submission should scale way better then before. Since before the execution is O(log n) in EVERY case.

tbh . O complexity i think is going to be dwarfed by cache misses... :) i suspect designing a compact cache friendly data struct is going to get the most wins...

but even then we have other issues like above - focus stuff seems to consume a huge amount of calls just for itself.. :( 2x as many as event calls..

So from this POV the event submission should scale way better then before. Since before the execution is O(log n) in EVERY case.

tbh . O complexity i think is going to be dwarfed by cache misses... :) i suspect designing a compact cache friendly data struct is going to get the most wins...

but even then we have other issues like above - focus stuff seems to consume a huge amount of calls just for itself.. :( 2x as many as event calls..

I dont think so, since its quite a difference if you walk a array with 2 elements or 20, this is not going to fit into a cache in a whole, and thus will still result in cache misses. So in the end.

And additionally, i know about the focus performance problem which is basically caused by this benchmark, as genlist creates/drops widgets a lot, which is not what is usally happening while using a casual application. Or are you using your genlist applications via scrolling 10 sec. up and down ? :-D

Anyways, its about to get fixed, the questions is just at which point. Since i dont really have time...

raster added a comment.EditedJul 24 2018, 9:43 PM

I dont think so, since its quite a difference if you walk a array with 2 elements or 20, this is not going to fit into a cache in a whole, and thus will still result in cache misses. So in the end.

look at the profiles above. the time callback_call takes is far less than _eo_obj_pointer_get. i'm looking at _eo_obj_pointer_get and thus it really matters on the NUMBER of times it's called - and that depends on how many api's are passing in the obj and looking it up. so as _eo_obj_pointer_get takes up a lot more of the profiles time vs. calling callbacks, i'm looking at how to reduce the impact and i'm down to "call it less" as there is little else to do... and it seems focus stuff is the top caller of this. :)

i'm using the genlist bounce test:

echo 55000 > /proc/sys/kernel/perf_event_max_sample_rate;
perf record -F 55000 \
elementary_test -to genlist
perf report --no-children --stdio > perf-report.txt
chown raster.raster perf-report.txt

so 100 bounces (much more than 10sec - more like 50sec).

but it is an important benchmark... :) genlist is probably the most important list to have scroll fast. creation and destruction isn't going to go away. it's a necessity to be fast. the problem is the focus code was making the creation/destruction be heavier. i'm currently just digging in fidning if there are any things to fix and i fixed the most obvious already (excess efl calls in genlist like move+resize -> geometry_set etc.). but this dropped 0.3% of calls. 8% is in the focus registering above... it's the biggest chunk i have found by far (2x the next one down like above).

bu5hm4n removed a subscriber: bu5hm4n.EditedJul 24 2018, 10:56 PM

Okay i give up on you, i have now told you multiple times that I going to fix this, I helped Derek disabling focus stuff so you have a unbiased benchmark. You refuse to listen, so here you go, do your crap alone.

??? i'm trying to point out that callback_call is a far lower user of cpu time than resolving eoid's to pointers - at least in this benchmark and the benchmark does matter because genlist does matter. a lot. so far your comments have been basically "well it's genlist's fault for creating and destroying things and that's not realistic".

you have no input on what to do at all on that. no suggestions, other than saying that i refuse to listen ... listen to what? you have provided nothing to listen to on the topic other than dismissing it (you've talked about event callback calling but nothing on the large number of eoid lookups caused by focus registering).

no "i'm going to change the design to do x/y/z" or anything of the kind, or "maybe we should remove that call?" or ...?. if your response to someone helping out and profiling, examining calls etc. is basically to tel them to go away which the above feels like (you don't listen. i give up on you, do your crap alone), then mental note. do not help out again. i could have gotten 5-6hrs of my day back.

No, my comments have been that it has been my misstake that i did not account that genlist is doing that so much, and i am fixing it.
I know that i did a mistake here. Sorry for that, but i am really tired of standing up each time repeating myself "yeah, i am going to fix it when i have time", from day to day to day to day, i know that the registration eats a lot of time, I KNOW IT. and i said that i will fix it. However not right now, not right here, due to the lag of time. So I helped derek to disable focus stuff. So you have unbiased numbers, however, its again and again and again dragged back to "but focus". Sorry for freaking out, but playing this game for multiple days really goes onto my nerves. I am trying to help, but this help feels like the help is getting bursted by every single reply in this ticket.

i know that the registration eats a lot of time, I KNOW IT. and i said that i will fix it. However not right now, not right here, due to the lag of time.

i have been trying to have you share information and knowledge so i can help fix it, but you've basically made me just stop caring and now not bother above. providing information/background helps. but your above comment has just told me to "piss off". you were talking about fixing callback_call and that's fine, but i was trying to point out that at least from an eoid lookup point of view, given it's one of the largest percentages, callbacks are a lower impact than some specific focus api's (if you followed what i was looking at and detailing it's the number of times it's called that really is what needs fixing rather than fixing the function itself). reducing the number of these calls can have a decent impact, but i wanted to have some background on them and perhaps suggestions on what to look into. at no point was i accusing you - i want you as the person who wrote the code to provide insight to save me time from figuring it all out myself the hard way.

In this case i probebly misunderstood what you were saying. I thought you are talking about the genlist-focus problem, where i meanwhile provided a patch to derek so he can get unbiased numbers while i continue working on a complete fix. However, I still dont understand what you mean. What information about focus do you want to have ?
The reason i was looking at callback_call is that this is a place where i see potential for optimization, i dont see it in the resolve calls, nor in the vtable stuff.

Lots to read here, only mostly skimmed it so far. Some points:

  1. My recent perf results have focus manager disabled to the point where it's doing precisely nothing. I did this (with bu5hm4n's help) because I feared people were expecting focus manager changes to be the silver bullet to resolving this, so would ignore the problem until bu5hm4n had a chance to resolve it. At this point it looks like bu5hm4n has a solid grasp on the focus manager problem - there's a bug ticket specifically for focus manager's heavy use of lists, T7049. Maybe we should broaden that ticket to include general performance impact of focus manager? In any event, I think we should at least give bu5hm4n a chance to work through the focus manager impact that he already has a firm grip on, while continuing on with other performance problems in this ticket.
  1. @raster , wtf is up with the zlib stuff in your perf trace? Are you running the same benchmark as the rest of us or have you rolled your own? Is this a theme related thing, or... ?
  1. These cache slot changes don't seem to be providing any improvement worth the effort? We need like 3000% improvements, not 1 to 2% here and there. Until the render ops are dominant by at least an order of magnitude I think we're still failing hard to efficiently use cpu/battery.

Honestly, there's way too much to read here now, and none of it really amounted to anything strongly productive? I think even if we could make all the eo related calls 10x faster, it won't be enough. We *also* need to remove tens of millions of calls (>50%, not 0.3%). Can we make other tickets for the small-time stuff and just link them here? Try to keep this as a jumping off point for solving the widespread major performance problems?

@ManMower - you should read what I wrote. There will be no making it 10x faster. I went into depth as to why that is not going to happen. There is nothing left to optimize there. I've tried caches with more slots. I've read every single line of code in the hot path. I've done ache hit analysis on single vs multiple slots thus knowing how often the hot path gets triggered etc. ... think it over. there is nothing to optimize at least for getting obj ptr from id in terms of the code/design as it stands in the cached path, and cache hit rayte is not bad, and more slots doesn't improve the profile though it improves the hit rate a bit. i mentioned dropping a level of tables (and extra ptr follow) to avoid more memory accesses but that then comes with other downsides. i've done all this with the conclusion that the only way to improve it is to drop the number of lookups so NUMBER of eo calls is what is important and that is a slow painful path of finding all the redundancy where 2 or 3 calls are made where 1 could do.

in the process i found that 6% of the calls are just for adding to the focus manager as above, so this is an area of impact if it could drastically drop the amount of eo calls there. but it'll affect just 6% of the calls.

i gave some examples and did some work and all that led to 0.3% reduction in call count. there is no silver bullet. it's a lot of painful work to maybe cut 20-30% in the end as the 0.3%'s add up (and maybe that 6% becomes 3%). you can see the test i'm using above - it ships with elementary_test already. i pasted a small script that sets env vars, runs perf on the execution as a whole from start (so it'll include startup time). it is what i have used for profiling many many many times before. it's why i added the autobounce etc. etc.

it's not the same pasted code as in the ticket but it's representative of efl and genlist too. it may also be my flat theme changing the profile as it'll have fewer objects in the edje objects, but again... ready my analysis and show me where there is room to optimize in the hot path e.g. for obj->ptr lookup. show me were i am so wrong by an order of magnitude. i have looked at this before many times and i thought i'd share analysis to save you time of re-doing it and re-learning the same things.

I hope I'm not out of line illustrating general performance problems here without first ensuring they're regressions...

my work in D7188..D7199 allows E to use hardware planes more efficiently, so in my test case (running elementary_test -to animation under E with the sw renderer), the animation client ends up on a hardware plane and no rendering operations are performed by enlightenment. It drops from about 60% CPU usage to 30% CPU usage.

However, that's 30% CPU usage when the render pass results in no drawing ops and no changes to any graphics buffers, so I thought where it's actually burning CPU time might be relevant.

6.03%          [.] eina_array_step_set
3.53%          [.] eina_array_flush
2.40%            [.] _eo_obj_pointer_get
2.28%           [.] eina_array_clean
1.93%          [.] evas_render_updates_internal_loop
1.81%                [.] _int_malloc
1.66%           [.] _eet_free_reset
1.58%            [.] _event_callback_call
1.54%          [.] pthread_spin_lock
1.53%                [.] _int_free
1.35%          [.] evas_render_updates_internal
1.27%           [.] eet_free_context_init
0.99%          [.] clip_calc
0.96%          [.] evas_object_smart_changed_get
0.89%  [unknown]                   [.] 0000000000000000
0.88%              [.] inflate

I left out the functions below this point.

@ManMower does this happen every render cycle ?? I'm curious where & (more importantly) Why eina_array_step_set is constantly being called....

That's perf top results for just the E process. I'm surprised eina_array_step_set shows up that prominently since I didn't think it actually did very much...

Agreed, which is why I'd be curious where it's getting called from ... smells like a hotpath that could likely be optimized to (at the least) reduce calls to that function

This is interesting. I would like to see what is the call trace for most of this step_set. One of my guess is that this is happening in evas render while building the various object array. A long time ago, there was a cache to avoid rebuilding them during each loop as this array really don't change, but because Evas_Map was hard to handle and make sure it wasn't impacting the array content, I didn't bother at the time as it was not a very common use case. So when an Evas_Map is present, the cache are disabled. That was ok, 10 years ago. It might not be ok anymore today.

I've been looking at eet's contributions, as it was doing a lot of extra work at startup. Now that I've landed some eet fixes, here's the top few results I get from the perf report for:
ELM_FIRST_FRAME=E ELM_ENGINE=buffer perf record -i elementary_test

7.94%  elementary_test          [.] _eo_obj_pointer_get
7.04%  elementary_test          [.] _efl_object_call_resolve
4.58%  elementary_test          [.] _vtable_func_get
2.35%  elementary_test          [.] _efl_data_scope_get
2.19%  elementary_test          [.] _event_callback_call
2.08%  elementary_test          [.] efl_isa
1.86%  elementary_test          [.] _eo_table_data_get
1.72%  elementary_test          [.] _efl_object_call_end
1.59%  elementary_test                [.] do_lookup_x
1.51%  elementary_test        [.] evas_object_clip_recalc_do
1.44%  elementary_test        [.] _edje_part_recalc
1.26%  elementary_test          [.] _eo_obj_pointer_done
1.17%  elementary_test          [.] _eo_class_pointer_get
1.13%  elementary_test          [.] _efl_unref_internal
1.11%  elementary_test        [.] _edje_object_message_signal_process_do
1.11%  elementary_test          [.] _eo_callbacks_sorted_insert

Seems every time I knock something down, we're right back to being dominated by eo overhead. :(

indeed startup time is a whole kettle of fish on its own - good you looked. what was the profile before you changes though?

eet at least for me has needed a new approach. the decompress+decode into heap costs memory and cpu time. i really would like to come up with another data encoding we can mmap directly into ram and never "decode" but access directly. at least we can share memory e.g. for edje theme data structs then across all processes that use it. but this is a major undertaking, not just optimizing, so what you're doing is pretty good.

Looking at before and after perf reports, I guess perf isn't actually the best way to show the benefit of that work - for elementary_test, eet wasn't prevalent in the report, and for a simple hello world eet is still the most prevalent thing in the report, just differently. :)

I was happy with the improvements, ran perf to see what a good next step might be, and got the results above.

A trivial hello world run with the buffer engine still takes around 60ms to start here, and some large subset of that appears to still be eet - but I think we're in agreement that there don't appear to be many low hanging fruit left there.

elementary_tests's 450+ms startup probably has the same amount of eet time, but so much more other stuff.

I think at least we have a loose upper bound for the amount of startup time an eet redesign could possible save. if it's a significant fraction of that 60ms it's definitely worthwhile.

@ManMower - good you looked.there could have been low hanging fruit and iv'e found a very low hanging fruit in eet recently in the design of the dictionary. it totally does not scale across threads. it actually becomes insanely slower the more threads you have... so using edje_cc's threaded mode is a lot slower than without because of this. it's a low hanging fruit to speed up edje_cc's threaded mode and any multi-threaded use of eet when writing, but i haven't had time to look at it more beyond identifying it.

my point i guess is that low hanging fruit do exist even in eet - just in certain code paths that are less trodden or cared for. in the common ones they likely don't exist anymore but who knows... you may find something, so it's worth looking. i think we're at the point is "design it differently" to speed it up the common stuff and that's a whole new world of fun. :)

Looking at valgrind, I also see no reference to eet anymore. Just eo stuff like the above perf.

Looking at which function in Eo cost, there seems to be a potential win if we were just not doing the safe pointer magic at all. In case when EFL is used in a binding, all the safe pointer infrastructure is actually a duplicated cost of what the binding is providing. Maybe it would make sense to make that part something we can just disable at initialization time, via en environment variable or a function call, and only keep it for C application when it still could be necessary.

It's dangerous to remove for the following reasons:

  1. If internally efl does an eo call on an invalid object (happens often enough) binding language doesn't save us.
  2. We use extra bits now in EOIDs for super call, ref tag that have to stay there for eo to function
  3. It won't help the API resolving - just the EOID lookup
  4. It removes thread protection so opens the door for cross-thread object access/abuse that has been a source of many mysterious bugs before

Overall I don't think removing it even if a language binding helps is good because there's a whole lot of accesses to eo API within EFL and then that is unprotected and crashes there will be our fault if we removed the safety.

IMHO the way to go is to reduce the amount of eo API calls both within EFL and within APPS. Like I mentioned above - instead of move() + resize() make that geometry_set(). we just halved the EO cost there with no downsides to safety. I did a run over elm_genlist I think a while back doing just this. (note this is a trivial example and the EOID cache would optimize this as it's the same object 2 times in a row - it was for illustration purposes).

Other things might be to make the EOID tables more compact to have better cache coherency.To do this we could move to 32bit pointers for objects on 64bit systems. All objects live in the same memory block thus now only a 32bit offset from block start is needed (bitshift up for alignment and we can have 32GB or 64GB etc. of objects then not just 4GB).

I just dropped about 27 commits into EFL which changes the evas_object_move & resize calls into One evas_object_geometry_set call (where applicable). I did Not go through Apps and do it tho.

  1. Yes, it is dangerous, but we have been living without it for years and we do tend to fix error that are triggered by Eo in Efl tree. At least, I haven't seen any in my use of upstream application. Not saying it isn't there, but that the risk vs cost is I think in favor of low risk for upstream efl internal bugs.
  2. That can be solved by aligning the pointer correctly and making sure that the lower bits are always free to use.
  3. That alone account for 15% of the benchmark under valgrind for the time to first frame after Chris patch that merge function call and Derek work on this too (From 17% to 15%).
  4. As I said above, this would be an option that can be turned off explicitely by bindings, because they already do all the checking themself anyway. Also there is no binding that support eo object together with thread and Eolian does not expose any information that would even enable binding to support threads. Meaning there won't be thread support in bindings any time soon.

Overall I don't think there is anything that we can do that will shave 15% speed win by itself. The risk vs the benefit seems to be in favor of having this option I think.

I think you think EFL is in better shape than it is... :) If EOID is turned off then a lot of current "it's safe" paths become dangerous INTERNALLY in efl. if there is something i learned is that 3rd parties will. use efl is all sorts of bizarre ways we never would and they find all the untested paths and then hit much bigger errors as a result. i think it's very unwise in return for a fair small performance gain.

removing EOID will not remove 15% of eo cost. we still have to do all the function resolving. i suspect it will actually make much less difference in the end.

if you give people a gun to shoot themselves in their feet, they will do just that and then blame you for giving them the gun. think about it. the best path is to just have fewer eo calls.

if you look at my call reduction path to genlist i removed lots of silly stuff that was of the kind of:

do_eo_call(obj, eo_call_get(obj));
do_eo_call2(obj, eo_call_get(obj));

i made that 3 eo calls instead of 4 by just storing the result of the eo_call_get(). in other cases the value gotten was already available in some other way for "free". fixing this up would be a general win for efl with no safety downsides and as it's changing internal calls it benefits everyone, bindings or not. :)

you can try and make this a runtime thing - it'll raise the cost a bit to make it runtime optional but then when EOID is turned off i think we're going to have more problems than it's worth given the work involved to create and maintain this over time and given it only improved the EOID lookup and not the function resolve cost, so gains will be less than you think.

perhaps first isolate the EOID lookup cost vs the rest (func resolve). my guess is the biggest cost of EOID lookup itself it actually just cache misses in looking up the EOID -> ptr and i think this can be improved with better EOID locality plus fewer levels in the tables (there are 2 levels now - down to 1 level would remove an intermediate lookup).

ManMower removed ManMower as the assignee of this task.Feb 14 2019, 12:13 PM