Page MenuHomePhabricator

Memory Issue in edje_cache_emp_alloc() api.
Open, Incoming QueuePublic

Description

Here are the few issues I found while profiling.

We create Edje_Part_Collection_Directory_Entry for each collection ( button, label , border , bg etc).

This internally creates 33 different mempools (16 per normal and 16 for rtl)
The memory footprint for creating 33 mempool is around 7KB. (Note its only the mempool bookkeeping overhead mempol, backend etc).

If you look closely both RTL and normal creates same kind of objects( ex: for both normal and rtl we create Edje_Part_Description_Text object and so on)
So only using same mempool for both normal and rtl should be fine.

I don't know the reason of creating Edje_Part_Collection_Directory_Entry per collection instead of keeping a global Edje_Part_Collection_Directory_Entry which I think will be more memory efficient as same object of different collections will get allocated in one pool. In this case we have to change the mempool type from "one_big" to "chained_pool".

If you guys think this is a bug needs to be fixed. I will raise a patch for the same.

If there is a valid reason to keep those mempools separate per collection , then will close this issue.

smohanty created this task.Sep 1 2019, 10:12 PM

@raster , @cedric , @Hermet

If you guys knows the history of this code, please let me know your opinion.

raster added a comment.Sep 2 2019, 2:02 AM

I think @cedric did the mempool stuff. My original code would have just malloced everything. But I do know why it creates mempools per collection... So to free everything it just frees the whole mempool and not each item one by one, making the "free it" cost lower. This is kind of a work-around to eet originally not supporting arrays for encoding so everything was lists. This allows us to collect everything into one region of memory for this implicitly via the mempool (because mempools are the same-sized items, a bit like an array). As I mentioned before. I never expected out edj files to become so incredibly huge. I expected them to maybe have a few dozen part collections (group {}) or 100 at most. We've ended up at like 10-20x my "at most" expectation now by default. As I mentioned before... I'd do things differently if I got to re-do it but not even decoding but doing everything mmaped live from the file. for the directory entry of Edje_Part_Collection_Directory_Entry's I'd do something like:

// series of tables one after the other packed to the byte ...
struct __attribute__ ((packed)) {
  uchar type        : 2; // 0 == container with N children, 1 == leaf node
  uchar entry_size  : 2; // 0 == uchar, 1 == ushort, 2 == uint
  uchar string_size : 2; // 0 == uchar, 1 == ushort, 2 == uint
  uchar val_size    : 2; // 0 == uchar, 1 == ushort, 2 == uint
  
  if (type == 0) { <entry_size_type> count; /*count of entries of below*/ }
  struct __attribute__ ((packed)) {
    <string_size_type> string_id; // byte offset from start of string "blob" below for string
    <val_size_type>    value; // if type = container, byte offset from start of tables for child table
                              // else this is collection id number
  } entry[ 1 | count ]; // either just 1 if type is leaf node, or <count> entries as above
};
// ... next table etc. until no more tables. first "root" table above is enough to find all tables

// then all the strings needed for the above encoded in a dictionary like:
[string1][0][string2][0][string3][0]....[stringN][0]

The downside - custom designed encodings to be as packed as possible and thus custom code to encode/decode just this, as opposed to generic shared code in eet. So more code in edje to write, get right and maintain. Much harder to extend... But a lot more efficient. We can avoid compression as we already de-duplicated all strings in the group paths like: "elm/button/default", "elm/button/anchor", "elm/check/default" ... means we only encode a few sub-strings like "elm", "button", "default", "check", "anchor" and we assume / is the delimiter. The directory entries are sized fairly well as we can change size encoding from uchar to ushort to uint depending how much data we have to encode per field/level. It's encoded as a tree so finding entries is implicitly a tree-structure walk. At least in O() complexity it's not bad. Yes - we can argue cache friendliness - that's why all entries at the same level are packed so you pre-fetch a bunch of them per level (cacheline) so walking the entries per level should be fast (still need to indirect to string dict area anyway... we COULD add another indirection to a string ID table that then has the byte offset for the string, thus having smaller string_id sizes as it's string number not offset. That's up for debate and trial. I'd just mmap the table blob and string id blob from an eet file and keep it mmaped all the time so when we have to look things up we're walking the same memory mapped into multiple processes, thus not allocating this directory tree per process.

Now why say this? We can today modify edje_cc to *ADD* the above table and dict keys to an eet file in addition to the current encoded data. For NEW edje lib code that goes together with this, it looks for these tables FIRST and uses them if they exist, and if not, it falls back to the old method, thus remaining compatible. Old edje lib code that finds a new .edj encoded file doesn't know about the new keys and keeps using the old data. A middle-ground can be where new edje lib code can decode the old dir entries from an old file, re-encode them at load time like the above in ram, keep it packed (but in private per-process heap) then FREE all the directory entry structs it decoded and use this more packed form instead. Bonus points: Have the encoding done in a thread in the background so it won't stall the load (but will use up other cores perhaps - depending on scheduler deciding to put this on another core or not) then once it's encoded, atomically replace the structures and throw them out.

So that would be a possible path to improving it that doesn't break compatibility. The downside is the complexity of encoding (and decoding) it all as described above. It doesn't re-use code. It's not easily extended. That is why I have had at the back of my mind some plan to find some way of doing the above kind of encodings more generically so we can have highly packed data structures that are decoded on-the-fly as you access them without needing de-compression or even allocation of memory. Today I might actually go for making a code generator for this that spews out the encode/decode C code thus meaning you edit some kind of meta-struct description above that produces the real world structures and functions to do the work. Unless someone has a better plan?

Thanks for the detailed explanation after reading it got little bit more understanding of the current issue with edje memory consumption issue.

Now I understood that Edje_Part_Collection_Directory_Entry is also saved in the edje file. (luckily we don't store mempool data in the edje so if we want we can change it without breaking the Edje binary compatibility).

It seems edje_cache_emp_alloc() allocates some one_big mempools and also updates some global mempool variable which will be used by the eet_data_read internally to allocate part description objects.

Also I see some collections use around 40 part descriptions freeing them all instead of freeing the mempool may slow things down if we are doing that operation often.

Here is the part description size i got by putting some logs

RECTANGLE : 448 bytes
SWALLOW : 448 bytes
SPACER : 448 bytes
IMAGE : 600 bytes
TEXT : 680 bytes

The Objects size looks surprisingly big knowing that each part description in edc will corresponds to a description object in Edje.

Yet to see if there is any chance of reducing some size by structure packing or maybe it's already done.

Anyway will wait for @cedric to comment on this.

Currently elementary_test , creates around 15 different type of collection (button, label , etc) so overhead of keeping separate mempools is 15 * 7 = 105KB

So If deleting the descriptions when we clear the collection is not too much an overhead (yet to profile and see) then maybe we can move to global mempool as a temporary solution till @raster refactors the edje.

cedric added a comment.Sep 4 2019, 2:26 AM

You could check the structure packing, last time I did was maybe 10 years ago and as things grow organically, there might be some win there.

But switching from one big mempool to any other allocation might severely impact load and unload time on lower end hardware (to noticeable degree as that's why I did it). I am guessing your hunt for more memory would be solved by the mmap strategy that raster is pointing to, but that's a lot of work. A potentially easier win, if Tizen still use quicklaunch would be to load a set of edje file and edje object during the quicklaunch server initialization. This would then be shared in memory across all process and be a fixed amount of memory for everyone. This require zero change in edge and would give almost the same gain as the mmap strategy raster is thinking of (almost in the sense that mmap would give a perfect cache and share effect, while this one will be as good as the file list and object that get loaded optimistically). I would recommend to implement this solution as it will have a multiplier effect directly function of the number of application being run simultaneously.