During development of EFL for 1.8 memory usage has increase along with new features. Here I went on a rampage with my new CoW.
Since September, EFL 1.8 has been under heavy development. The Samsung Israel team added Eo (a new object model that should help unify all EFL objects, and should have its own blog entry), Profusion (now part of Intel) added Evas async rendering and Ephysics. Of course a bunch of other smaller changes went in, and overall the memory used in our tests grew quite a lot from 5.4MB to 8MB. That rang a bell, and I got into looking at what went wrong.
If you are already bored, The end results are that we are now back at 5.6MB and we should be able to gain another 300K to 400K before the release (something planed for April/May at this point). So now that you feel better, you can go back to scratching your under-arm hair, or spend some time reading the rest of this blog.
The first thing to do, when you are optimizing, is to compile a set of tests that will work for all revisions and give you number that you can trust during all your development. I chose some elementary_test cases that look exactly the same in 1.7 and in our development branch (which is 1.8 in-the-making). From there I used valgrind massif and massif-visualizer. If you don't know about those, spend some time playing with them and learn how awesome they are !
And here where the winners (in terms of biggest memory footprints):
- 3MB of Eo objects
- 1.2MB of Evas_Object_Image
- 637KB of Evas_Object_Rectangle
- 465KB of Evas_Object_Text
- 345KB if Edje_Object
- 1.9MB of mempool
- 814KB of Edje Part
- 439KB of Eina_List for Eo type
- 350KB of Evas_Object callbacks
- 100KB of image draw command
- 463KB of Edje matching automate
- 370KB of image pixels data
The first thing that struck me was, what is this 439KB of memory for a stupid list of static strings? In fact, it was a shortcut taken during the development of Eo and instead of being part of the class information, they put it in each object. Of course that was a bug, was quickly fixed after @tasn spanked @JackDanielZ.
So what went so wrong? Why are all those objects that big? The reason is quite simple. We added more and more flags and features. The objects then grew in total size, even if most of these added values are never changed or only changed in some rare cases. Of course @raster came up with the crazy idea of compressing object on the fly in memory and decompressing them again... but instead of being crazy, what if we didn't duplicate memory for nothing in the first place? Thinking about it, most of those objects have exactly the same values, and even better, they never change them.
So after a quick round of grouping data in their structures when they are used together, I came up with the idea of doing a kind of "copy on write" infrastructure. The idea being that most of our access pattern are reading and very few are writing. Especially in the hot path. After a few rounds of design, Eina_Cow was born. It then started to roll into Evas. The result is that now we have:
- 400K of Evas_Object_Image
- 300K of Evas_Object_Rectangle
- 289K of Evas_Object_Text
- 228K of Edje_Object
And 479K of modified data. The next stage would be to run some memory comparison functions during idle time to merge modified data back together where duplicates within the modified section are found. I also didn't pay attention to Evas_Object_Text and Edje_Object. Their size went down just because Evas_Object sizes were reduced. That first improvement gave us back 1MB of memory.
What I learned when rolling in Eina_Cow, is that the only source of bugs comes from piece of code that uses the stack or memcpy()s parent structures to duplicate references without instructing Eina_Cow about it. The rest of the changes are pretty straightforward and easy to do. The interesting part is, that most of the code logic stays untouched, and there is no need to add tests for NULL all over the place.
Logically the next place to get the Cow treatment was Edje Parts. In these, we use a cache of previous calculations to avoid a lot of extra work. That cache was pretty big. It includes values for Evas_Map or Ephysics even if most objects don't use them. That was about 400 bytes per part per Edje object. By using Eina_Cow we got that one out and now Edje Parts use 464K.
I realized that Edje was duplicating much more data than needed. First, a small bug was duplicating program string match per object when they really should have been per class of object. That was small, but still 100K. The big one was Edje signal callbacks. Elementary always set the same triplet of signal, source, function that don't change plus a data that does change. So we lazily implemented one string match per object even if the match was always the same, as it only matched based on signal and source. I decided to implement some full logic to try and not duplicate those matches by detecting when the callbacks where the same. This was a difficult task as we do a lot of registering/unregistering of callbacks, and have many optimized paths. But I managed to do it and saved another 463KB.
So our savings in summary are:
|Eo type strings||439K||0K|
At this point, we are almost back to where we were before with 5.6MB, but with a lot of new cool features in. Looking at what is left, there is some hope to get rid of a big chunk of the memory allocated for Evas_Object callbacks and also the image pixels data should be shared with other process thanks to Evas Cserve2 and Profusions' work (This would really be worth another post on that topic alone). The idea for Evas_Object callbacks is also to de-duplicate them, as we register a lot of them together with always the same values except data. So we are going to have a way to register a static array of callbacks for an Object, and that alone should reduce our memory usage dramatically for callbacks.
Now that you have read this far, you are probably wondering why we care so much about memory. Why do we think hat it is not ok to add features AND also add memory footprint in return for it too. Why does it really matter when we have multiple GB of memory available? Why spend time on such useless optimization? No. Seriously. Why?
Well, the answer is simple: speed and power consumption. Most of our tasks are memory bound. Using less memory gives us more room for doing the actual rendering. The CPU is actually so insanely fast that one core is almost able to fill all memory bandwidth for most rendering operations. By using less memory, we hope to use less memory bandwidth for things that don't really matter and then have more bandwidth available for things that do. So before we start using S2TC in our software engine, using less data for everything else is clearly a good move.
As for power consumption, thanks to my work at Samsung, I know now that using memory is much more costly on battery than using the CPU cache. In fact, every level of CPU cache uses more power than the previous one, so the more you stay in L1, the better. Of course this directly affects performance as well, so measuring performance is a simple way of measuring potential power consumption. Of course L1 is to small to put everything inside, but you get the idea. Being smaller means less battery usage. Also on a mobile phone the bigger the main memory, the more battery it uses, even if you don't access it. So if your system uses less memory, you can ship it with less memory (Yeah, the marketing department is going to hate us, they can't play the game of bigger numbers...) and having more battery life with the same kind of applications.
That's why Samsung cares about memory consumption and optimization.