Page MenuHomePhabricator

Slow event emission in eo
Open, HighPublic

Description

The event emission in eo is slow. We know that. Evas for example works arround that by tracking internally if someone subscribed to a event before emitting it. To get a number of how often we are emitting a event without anyone listening to it, i have created a hack in eo, you can find the branch at: https://git.enlightenment.org/core/efl.git/log/?h=devs/bu5hm4n/dead_events (I am calling these situations dead events)

The output looks like:
Emitting dead event del 0 2
The first number (the 0) indicates if the is legacy or not. (0 not legacy, 1 legacy).
del is the name of the event
The second number (the 2) is the amount of callbacks we have walked.

The more callbacks we have walked, the more time we have spent in there. So maybe the one with a high second number should be prioritized.
After a quick look over a example run with elm test:

(Lets reevalulate what we are doing with this once we have fixed the biggest ones)

Details

Differential Revisions
D10493: evas: move watching over destruction of device to custom logic due to high use.
D10492: elementary: handle case when XFIXES is not available.
D10491: evas: deprecate evas_device_parent_set.
D10490: evas: move exposed internal structure of Efl_Input_Device to be private.
D10489: evas: mark Efl.Input.Device as @beta.
D10488: evas: move efl_input_device into evas/Efl_Canvas.h
D10487: edje: load edje seat callback only when necessary.
D10486: edje: improve callback count on Evas canvas.
D10485: improve debugging info
D10484: eo: add debug ability to detect long chain of event handler.
D10483: eo: reduce back memory limit in test.
D10482: eo: reduce memory use to track event registration and callback call.
D10481: ecore: remove custom code for generating Eina_Future_Schedule attached on Efl.Loop.
D10480: eo: add infrastructure to attach an Eina_Future_Scheduler to any source of event.
D10479: eina: only the type need to be NULL to assume EINA_VALUE_EMPTY.
D10478: fixup
D10477: ecore: remove unecessary code for Eina_Future scheduler.
D10476: ecore: remove unecessary field in data structure.
D10475: eo: prevent unecessary callback call walk.
Commits
D10354 / rEFLa7c2c91d7d0e: eo: block "invalidate" event emission when there are no subscribers
D10508 / rEFL2477dfaf5941: edje: selectively inhibit content,changed events when changing swallowed parts
D10361 / rEFLdc5c17a104c9: evas/render: selectively inhibit render callbacks
D10360 / rEFL7ad8acc290f0: efl_ui/layout: selectively inhibit theme,changed event
D10358 / rEFL78ea96af89b9: evas/smart: inhibit smart member add/del callbacks when no subscribers exist
D10357 / rEFLb1a2d82417e5: edje: block "recalc" event emission if it isn't being subscribed to
D10359 / rEFL8c6b26af015a: ecore: inhibit "idle" event emission if no subscribers exist
D10355 / rEFL3175690466bc: evas/smart: inhibit evas-internal smart callbacks when there are no subscribers
D10354 / rEFL8218fb0e2030: eo: block "invalidate" event emission when there are no subscribers
D10348 / rEFLb7432f690f86: efl_ui_focus_object: do not emit focus_geometry_changed when not needed
bu5hm4n created this task.Oct 9 2019, 10:52 AM
bu5hm4n triaged this task as High priority.

Now that we are making it a consistent pattern to only generate events when there is a handler that did subscribe to it, maybe we should think of a nicer API to track the addition. removal and count of handler. The tricky bit is that we want to watch an object even before any of our constructor code is called. I am thinking that the way to go might be something like:

  • A function to count the number of event handler registered for a specific event (Easy to implement and should provide acceptable speed by just walking the handler array in the Eo object).
  • A function that will trigger a callback when that count goes up
  • And another one that will trigger when a callback that count goes down
  • Finally a few macro to wrap all that into as little code as possible (One macro for the boolean in the structure, one for the callback function and one for the constructor to initialize the tracking).

This should lead to 3 lines per event type per object to implement the tracking. What do you think of this proposal?

I was considering whether this could be something that eolian could generate somehow since it feels like we'd want it for all events?

wouldn't a nicer general solution like a cache of event ptr -> count of cb's registered be good? keep the last N event types in this cache (have it be a small 4, or 8 slot or so cache) ... and we need the ability to remove the cache entirely on "inactive objects"? the callback call can look in cache, if it's there and count is 0, instantly-return. if not in cache - do a walk to try call and while it walks, update the cache to have an entry? we need some timestamp/cycle count or something for the last time a callback was called on an object and so with objects that rarely produce events the caches can be freed...

we need some way to track objects that have event caches so they can be cleared out later - that or we literally do some GC-like sweep walking all objects in an eoid table looking for these and nuking them - it might be handy to have some idle walker that walks around objects in the table with the idea of idle-time sweeps that walk over objects to call some kind of "gc" call that evaluates the obj to see if it needs cleaning/compressing etc.? this sweep can know what it walked so far and walk un-walked objects until it's finished a sweep then reset it and start again.

the mainloop can set a sweep to begin every N seconds and then do it in an idler (do only a few objects (~20-30?) per idler run so it returns back to be preempted by loop wakeup fairly fast... ? this would be super useful for any kind of caching, compressing etc. etc. and cleaning out an event ptr -> count cache is a first good such use...

@raster the problem we have with a lot of event, like invalidate, is that we trigger them once, but on every object. A cache per object wouldn't be very helpful as it would likely increase that single call cost and not offset it by any win. It could help on repetitive events, but I am afraid that the increase single shot call cost increase would be a problem. Also it would not remove the cost of the eo call for nothing.

We could maybe add another boolean on events to define the one that are single shot and bypass the cache in their case. In which case we would use the infrastructure I just proposed above. Need to think about this cache a bit more.

but we call other events a lot like mouse move or move,resize and geom change events. a cache would only cost an extra cmp+bra if empty before walking the events ... if it is a few slots its a small walk of N items probably in a short array of 4 or 8 items. in your case of "not so frequent invalidates but invalidating everything" the invalidate should find most objects with empty caches so its the extra cmp+branch for the cache ptr check which should be next to the events so its in the same cacheline. yes. it's a cost. will this cost give us a win in all the other cases without having to hand write special event counting and maintain a bunch of event counters

another idea might be to just have a 32bit int event mask. every bit is a "bucket". we hash the event desc ptr down to a 5 bit (0-31) value. now that event lives in that bucket. whenever we add or del events we invalidate the event mask (set it to all 0). if event mask is 0 then the next time we walk the event list we hash every event ptr and fill in the event mask (at least 1 bit will be set if there is at least 1 event cb).

so we know that if a bit is 0 there are ZERO events for anything that maps to that event bit. it's not perfect but it's low cost to maintain, low memory (just an extra 4 bytes), and it gives a "definitely no events for this type" or a "there might be an event for this type" result just by doing a hash of the event dsc ptr and checking the matching bit. it doesn't need the gc-like sweeps of the eoid table.

how about that? since everyone inherits from base class accessing the event mask is a scope_data_get of the base class and then just returning that int. a static inline that does this, hashes the event ptr to the right bit, checks it and returns true/false so we do:

static inline Eina_Bool efl_event_likely(Efl_Object *obj, Efl_Event Description *desc) { ...
   Efl_Object_Data *pd = efl_data_scope_get(obj, EFL_OBJECT_CLASS);
   unsigned char hash = efl_event_description_hash(desc); // hash desc down to a 5bit value
   if (pd->event_mask) return EINA_TRUE; // mask is invalid so be conservative and assume true
   if ((1 << hash) & pd->event_mask) return EINA_TRUE; // bits match - there might be an event there
   return EINA_FALSE; // bit is 0 but mask has something in it - definitely no events there
}
...
if (efl_event_likely(obj, desc)) efl_event_callback_call(obj, desc, info);

thinking along these lines. cheap in terms of mem usage. it is a scope data get + hashing effort... but i'm hoping on average this helps rather than hurts.

That second idea sounds like actually something that could work in almost all case, even maybe for invalidate. We might still want to hand optimize some later, but overall this should definitively help. We just need to hash for a 32bits or 64bits bucket. Sounds fairly easy to try, low memory impact, and not much speed impact as we just need to go from a pointer to 6 or 5 bits with a light hash function. We spend some time next week trying this.

OH don't get me wrong - hand optimizing for some special cases where it really makes sense would be good, but a "optimize most cases to a reasonable degree on average" will cut down the work.

we could use either 32 or 64 buckets based on ptr size so that the event mask can live in an aligned spot right next to the callbacks ptr -> or always be 32bit and live next to the callbacks_count.

so you think even though its not totally accurate like full per-event desc counters, it's probably going to work ok? this is where the question is - will be bit of extra work be worth it given the hit/miss rate of predictions here (it'll always be somewhat lossy and mis-predict but the more bits we have and the better the hash func the better the hit rate will be in avoiding calling a callback at all when no one is listening). this may actually be able to remove a lot of the special case code we have now if it proves to be good enough?

the only thing that concerns me is the cost of the data_scope_get() here. the hashing will be fairly cheap done as a static inline. the access to event_mask will maybe go down some layers of cache or out to memory and stall .. but we'd need to go access that mem to call the callbacks anyway so it's a price we would pay without this event_likely(), so i only see this as bringing forward a cost we'd pay anyway without this. ensuring pd->callbacks and this mask share the same cacheline is probably important. the rest (bitshift + bit testing + branch) is really nothing in the scheme of this...

so does that scope data get hurt us more than all of this helps... we don't know until we try i guess? no gut feelings?

It seems hard to know until we try if it helps or not. The theory sounds good and might help. I will also put some statistic in place to see how many handler callback are registered, how often we hit the callback_call function and so on, check if the distribution of the hash is ok. And we will see if that help or not in expedite and elementary benchmark. Will let people know how this goes next week,

Additional information to think of, the invalidate callback when used, is usually actually a massive amount of handler on the same object for the invalidate event. Mostly driven by clipper object. I guess this is a case that will be a good idea to optimize for. None of the above proposition will help for this case. I am thinking maybe trying to group handler in a list/array when we have the same event having tons of handler registered on it. Will try to implement that in a append only logic to avoid having an effect on the order of callback being called and reduce also the insertion of handler cost.

I have been playing locally with a few different patch.

The hash trick does kind of work and doesn't... Basically with our current level of cut off, we gain nothing (and loose nothing) from running it. I will still propose a patch as there might be workload that it help.

I haven't been able to create a generic infrastructure for detecting callback add/del of specific event handler that satisfy me and doesn't increase memory usage or cpu usage. I will still be thinking on this one.

Finally I was starting to work on having some optimization for the case where callbacks are pilled on the same object with the same value and found out:

  • that in evas_object_main.c:L178 is called thousand of time on one single object. It pretty much never goes down. Seems like every object in the canvas is calling that code. This is clearly one of the worst offender possible. I do not have much understanding of this code and will dig later tomorrow in it to see if we can just get rid of that callback handler.
  • There is a bunch of other unexpected offender from edje_load.c:L664 which get to count in the hundred on the main canvas object. I also think we could shut that one off at the source instead of optimizing for it. Need to investigate that code later tomorrow too.
  • The clipper invalidate handler will be the last one that require reflection. I am starting to wonder if that shouldn't just be part of the invalidate code of all evas object which would eliminate the need to optimize for list of event with the same func/desc/priority.

I might add the logic for detecting this bad behavior in the debug build.

cedric added a revision: D10478: fixup.

There is a bunch of other unexpected offender from edje_load.c:L664

Most interesting... So every edje object will be adding maybe 3 or even 4 callbacks to the parent canvas object. given a few hundred edje objects (buttons, icons, etc. etc.) this is a LOT of stuff in a single event callback list. But looking at what it does. i am not sure there is a way around this. The object needs to know when these things happen.

So first... I imagine that this list manipulation here is going to add quite a lot of cost on its own to object creation and destruction as especially removing the callbacks means walking all callbacks to find a matching one to remove .. and that walk is not cheap (see below).

Second - any event callback called on the canvas is going to be quite costly as it walks this list of 1000, 2000 or more callback nodes trying to figure out which to call or not. Even as a callback array this is not going to be good. In fact it'd be worse because it would remove paths for optimization.

I think we need to revisit the callback storage per object. Right now it's an array of callback desc pointers. Each desc contains the event desc ptr, func ptr, data ptr, priority (and a generation counter). I'm ignoring cb arrays right now but they fitr into the same memory blob and just add another indirection. So:

  1. This will be indirecting to fragmented memory to walk every callback desc. Tthis means we're cache-missing almost every time I wold guess (at least a l1 miss) per item we walk.
    Each item is 32 bytes (24 on 32bit if I account for alignment). Chances are each item will be somewhere else in memory so they won't share a cacheline.
  1. There will be a lot of repeated data. The event desc ptr will be stored again and again and again. In fact in the above example we will be storing the same function poinbters again and again and again... This can be compresses/reduced for sure.
  1. Callback arrays mean might cut down the number of nodes to 1/3 or 1/4 as many but it's still 100's and it still all has to be walked AND there is an extra indirection to walk the array in each cb desc.

I think we need to actually avoid callback arrays. Especially with this kind of usage. They make it harder to optimize. What I think we need to do is... flexible memory format/layout. We effectively compress memory by finding known patterns and having custom memory walking. For example:

typedef struct {
  unsigned int encoding : 8; // how the callbacks for this description are encoded
  unsigned int offset   : 24; // offset in 4 byte chunks (allowing up to 64M of callback data as a result
} Efl_Event_Callback_Entry_Info;

struct {
  // ...
  unsigned short                 desc_count; // number of elements in the desc ptr array below
  Efl_Event_Description         *desc_entry[]; // variable sized array - will need to really store ptr to this array
  Efl_Event_Callback_Entry_Info  desc_entry_info[]; // also same count as above desc entry  matching 1:1 info on how callbacks are stored
  void                          *desc_callback_data; // pointer to chunk of memory containing all the callback data in some kind of encoding per chunk
  // ...
}; // callback event info in base class - just striping out this section for illustration

typedef struct {
  void *funcptr;
  void *data;  
}; Encoding1; // single func + data - assumed priority 0, thus not encoded - single slot only

typedef struct {
  unsigned short callback_count; // number of entries in the below array
  // obviously alignment padding here...
  struct {
    void *funcptr;
    void *data;
  } callbacks[];
}; Encoding2; // multiple callbacks + data all assumed to be priority 0

typedef struct {
  unsigned short callback_count; // number of entries in the below array
  // obviously alignment padding here...
  struct {
    unsigned short priority; // priority of this callback entry
    void *funcptr;
    void *data;
  } callbacks[];
}; Encoding3; // multiple callbacks + data where we may have callbacks of differing priority

typedef struct {
  unsigned short data_count; 
  void *funcptr;
  void *data[];  
}; Encoding4; // single func + call N times, once per data ptr, all assumed priority 0

// ... we can add more encodings over time

So think of the callback data as more of a file format which may vary layout in memory (like on disk) and try and have as little repetition as possible. So for the above edje load case, encoding 4 will be perfect. Yes. We need logic to detect which Encodings to use. It should be possible to have some "optimize" or "compress" pass that re-encodes event callbacks after N calls to callbacks.

Now here is an idea that just popped into my head. We have stringshare for ... strings. We should have funcshare and funcptrshare. Certainly for each object we could provide more encodings like:

typedef struct {
  unsigned char data_count; 
  unsigned char funcindex;
  unsigned char dataindex[];  
}; Encoding5; // single func + call N times, once per data ptr, all assumed priority 0 as byte indexes

typedef struct {
  unsigned short data_count; 
  unsigned short funcindex;
  unsigned short dataindex[];  
}; Encoding6; // single func + call N times, once per data ptr, all assumed priority 0 as byte indexes
...
  void *functable[]; // above funcindex is the array member of this func table
  void *datatable[]; // above dataindex is the array member of this data table

When the same func ptrs and data ptrs appear again and again but we don't have too many different one, we could have encodings with just a byte for the index or a short. The less memory we walk, the less cache misses we will have (also the less memory we use... think of this as a data specific form of compression at runtime). Yes - there is an indirect to the func and data tables. The idea would be that the above optimize() pass figures out if this will be a win or a loss. How much memory do we save and thus cache misses vs the added misses from the indirections. At some point it'll be more efficient to encode like above. It also has to choose between char or short encoding depending how many funcptrs and dataptrs exist.

Now the question might be - this is all nice if we can optimize and encode only once then never modify it again... it almost definitely can be faster. We need to balance this out against continual modification. So every edje object that is deleted needs to remove its callbacks. Also every creation adds callbacks too which may alter the encoding type needed etc.

I'm thinking out loud here. Perhaps some simpler subset of the above ideas might end up best? like just some simpler encodings like Encoding4 which definitely is going to save a lot of space and thus make walking far more efficient with fewer misses for the above edje load case. it also is easy to modify (add and delete callbacks) without expensive re-encodings. it also can encode just specific event description types this way and not others.

I totally agree that we should revisit eo-event-array. I have another idea that i just wanted to share:

Right now we have one big single array that contains *all* subscriptions to an event. @cedric already implemented your idea about a "hash" of a event desc. which can be used to check if there is any description at all. How about utilizing this idea even more: We could hash the hash a little bit more so we only get 4 bits out of it, which means, (with a good PDF of the hash) that we have 16 different possible hash values. We could use them as some sort of "bucket" identifier, which means, we would have (worst case) 16 different "buckets" where we are storing subscription in. That means we are splitting up our one big array into multiple. Which gives the chance, that even enormous event subscriptions (1000 times the same event on the same object) will not impact another event that has a different hash value.

Thoughts on this ?

After spending a week on reviewing our usage, I think anything advice 16 registered callback is an indication of buggy/suboptimal code that can be solved. Once my series of patch land this will be fixed. Additional optimization is I don't think necessary at this point. There might be some more place where array callback pays off, but I haven't looked at it in details yet (don't forget that array callback provide compressing ability across multiple instance).

After spending a week on reviewing our usage, I think anything advice 16 registered callback is an indication of buggy/suboptimal code that can be solved.

What's the solution to the edje_load case above where every edje obj is listening for device changes? this kind of pattern would be common enough where objects want to know about some change on some small number of "master" objects?

For now I detect at runtime when there is a part of a signal that needs multi seat information and only then turn on those callbacks. It could be improved by having detection embedded in the file format. I do think most edje file won't have seat customization at the end. For now, it isn't a very general case. If I filter to callback list above 16, I find no other culprit at the moments (with a hash of 32 entries, 16 seemed to be a good cutting point).

@cedric - ok. cool, but what of cases where we literally do have to have a lot of objects subscribe to some event on a shared object (e.g. a parent or toplevel)? mainloop obj is going to be a nasty place for this kind of stuff... :(

I would wait for this to be a problem as I am not sure what pattern need to be optimized. If I had just optimized for our bad case, so far making func_data a null terminated array would have been a sick option that solve the problem. Not sure it would help in a mood general manner, which is why I would just wait for now, except if you have some app that did have a problem.