Page MenuHomePhabricator

eina blist data structure proposal
Needs ReviewPublic

Authored by raster on Dec 17 2018, 9:17 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

Eina Arrays are compact memory-structures but are bad at modifications that append, insert or remove. Over-allocating the end can alleviate some of the appending issues but none of the others.

Eina Lists are double-linked lists and are great for append, prepend, insert and remove, but are very costly in terms of memory footprint and costly to walk due to cache inefficiency.

This is intended to be able to replace many uses of Eina Lists (not Inlist) inside EFL and other places to have faster list operations and smaller memory footprints. It requires work to go replace it and in some cases it's hard as external APIs rely on there being an Eina List returned or ingested by the API.

This includes test code that does a set of operations to an Eina List and a BList (from walking to modifications of various types). My benchmarking shows that BList is somewhere about 5x-10x faster in this test (depending on number of iterations and list length) vs. Eina List and uses about 20% of the memory vs. Eina List. Those are fairly impressive numbers for a first-stab at this, so it's worth looking more into this.

This is a proposal for something that strikes a balance between the 2. It's a "Block List". A double-linked list of "blocks" of arrays. Insert, Remove, Append, Prepend are relatively cheap, and walking is now more efficient too. Far more. A block size has been chosen at 128 bytes for this as this is a single cache line on some architectures and 2 sequential cache lines on almost all others. That block contains pointers to the next/previous blocks (also squeezing in the element count into the lower 7 bits due to this 128 byte alignment). The rest of the block is data pointers in an array.

When an item is removed or inserted all items in the array are shuffled up/down that are after that item. If the array is full then a new block may be created after this one. If the block goes to a 0 population count, it is freed and the neighboring blocks in the list are re-linked. Precisely when frees do happen or not is an implementation detail though.

The API looks different to arrays and lists as it tries to have the properties of both the list and array APIs in Eina. The BList is auto-allocated when first having any content added, and de-allocated when the last item is removed, thus a NULL BList handle is an empty BList. you don't have to worry about allocating/deallocating the BList separately to inserting or removing items.

It only accepts pointers as content because accepting entire data structures probably is not worth it. I have not implemented any Eina Iterator or Accessor as it's not an absolute requirement at this point to discuss API and general design. It doesn't have full documentation yet because this is more about discussing design and ideas and direction.

I have been careful to design the API to hide the implementation (unlike Eina Lists) specifically so that the implementation can change. either over time it can be optimized or even at runtime BList could transform how ti stores things based on usage patterns. it could flatten out a BList that is almost never modified into a big array, or choose different block sizes. I think it's important to make this opaque and hide such details for future speedups.

This has only one downside vs. Eina List - you cannot keep an "Eina_List *" style pointer handle to the list node around. With Eina List such handles can be stored inside other structures for instant O(1) access to the list node to then be able to de-link the object without hunting for it etc. Indexes can't work either as the index can change with modifications to the BList (items before and indexed item can be removed or inserted). The API does support a form of handle (Eina_Blist_Item *) but this is intended as a temporary handle only and can only be valid until the BList is modified. Once modified this handle is then invalid. This leads to eina_blist_clean() and eina_blist_clean_force() as explicit "clan up this list please" calls (this first may do nothing if it doesn't think it's worth cleaning up, and the second forces a cleanup/defrag etc. even if it's probably not worth it), so modifications to the list have to be explicit and reading can't modify it because you need to know when an item handle may invalidate.

In addition the list supports a free function (like Eina Hash does) that is called when the BList is freed on all items still in the BList or when items are removed from it as a convenience to cut down remembering to do it yourself.

This patch review is a bit of an RFC about this. Adding it to Eina is just adding extra code (until it's in use), but it can't be stable/public API until it's "acceptable" and so the API design and anything it implies or that affects it are really key here.

This is just some code to review and examine. I've uploaded the raw files here

F3520530
F3520531
F3520532

To compile and run:

gcc blist.c eina_blist.c -o blist `pkg-config --cflags --libs eina`
time ./blist blist 10000 1000
time ./blist list 10000 1000

You will probably find the list test is about 10x slower than the blist one. e.g:

./blist blist 10 1000000    1.76s user 0.00s system 99% cpu    1.761 total
./blist list  10 1000000    7.85s user 0.00s system 99% cpu    7.854 total (4.5x)
./blist blist 100 100000    1.70s user 0.00s system 99% cpu    1.707 total
./blist list  100 100000    7.95s user 0.00s system 99% cpu    7.957 total (4.7x)
./blist blist 1000 10000    1.76s user 0.00s system 99% cpu    1.765 total
./blist list  1000 10000    9.95s user 0.00s system 99% cpu    9.959 total (5.7x)
./blist blist 10000 1000    3.96s user 0.26s system 99% cpu    4.225 total
./blist list  10000 1000   49.58s user 0.00s system 99% cpu   49.587 total (11.7x)
./blist blist 100000  10    3.70s user 0.03s system 99% cpu    3.728 total
./blist list  100000  10   43.89s user 0.00s system 99% cpu   43.890 total (11.7x)
./blist blist 1000000  1   43.20s user 0.02s system 99% cpu   43.227 total
./blist list  1000000  1  460.35s user 0.04s system 99% cpu 7:40.430 total (10.7x)

Note that speed up get bigger the longer the list, but are about the range of 5-10x. Heap usage is also improved bout about 5x as well.

If your system is very slow, remove a 0 digit from the 2nd parameter (how many times it does the same test - loops).

So thoughts, improvements? Design changes? Reasoning behind them? This isn't worth it?

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped
raster requested review of this revision.Dec 17 2018, 9:17 AM
raster created this revision.

Interesting experiment.

I'm not really able to duplicate your results for the smaller lists though, using an i7 8600k here. both EFL and the demo compiled with -O99 (since optimizations really make a difference for this sort of thing). Run times are much closer together for me than you. list actually beats blist until the list length is significant, like well over 1000.

Further, if we ignore the relative append/prepend nonsense and the searching delete operation, good old eina_list wins by a landslide. So it feels like this is kind of optimizing the cases a smart dev wouldn't be using a linked list for in the first place?

Definitely an obvious memory savings, but not sure it's worth the complexity for it? Do we really have linked lists long enough that the pointer overhead is a real problem? At that point perhaps we should consider a singly linked list? However, we already have something like 3 linked list APIs present, would need a very compelling reason for adding another. This has me thinking about trying to add some instrumentation to lists to see what our average length is for a few common apps.

I'm a little nervous about microbenchmarking this and deciding it's great, all those shuffle ops look like they could have cache ramifications for calling code, and that won't really show up here?
Also, Is it safe to assume malloc/calloc make sure we're 128-byte aligned? or will performance magically fall off a cliff when something else outside the microbench does a malloc and we end up only 64-byte aligned or something? Don't we already have another mempool abstraction in place this could've relied on instead of rolling its own? Sometimes disabling eina_mempool is useful for debugging purposes. Having that same mechanism disable the pooling here would probably make valgrind results more intelligible.

Maybe we should be looking through our internals and finding places that use eina_list when eina_inlist, or as hash table, or some other data type would be more performant, though.

I'm not really a fan of adding a fourth list API, especially if it primarily just optimizes things linked lists aren't the preferred data type for...

Admittedly my numbers were on aarch64. But don't misunderstand list length being the real factor. It's cache hits that is. BList is intended to reduce the downside of cache misses very specifically. It's reliant on that. So lists that don't stay in cache very much are going to be far worse on Eina List vs BList. The length in this test is more a measure for "how much of the list will be in l1/l2/l3 cache". The bigger the number of items, the less will be in caches (or less in L1 vs L2, less in L2 vs L3, less in L3 vs RAM). I maybe could have putt in some code to pollute caches regularly between ops as well as well as a "how often to pollute caches and by how much" as part of the benchmark. I could adapt the test to do this explicitly. I was relying on a very simple factor of list size determining cache hits vs. misses.

There are lots of lists e.g. in Evas that will be accessed a lot during render (walked) and then only rarely be modified here and there while objects are modified (clippees for example). So when a clip is set or unset or a clipped object is deleted then these lists are modified (some ptr is hunted by a walk and then removed). When this happens the changes of the list nodes being in cache is very very very low IMHO as app code has polluted at least all of L1 and EFL too for sure, and maybe some of L2 and even L3. Rendering will almost always pollute all of the caches and wipe everything out. so these lists will almost definitely be pushed out of cache at least once per frame, if not more often.

I used posix_memalign() to ensure alignment, so keep in mind that this isn't using eina_mempool and maybe has a perf hit there vs. Eina List due to this. I just kept this simple for now. It's not as optimal as it could be. IT's more about discussing the design/idea of having linked blocks of arrays. How often do we merge. What should a block size be?

You are right that lists that live in L1 cache (like temporary lists that are smallish) would have zero benefit here. This data struct is not intended for those cases.

We could use single linked lists, but these are still heavy. We have an accounting ptr per node in our double linked lists (because every node is the same) to keep track of for example list count to make a count get be O(1) as well as tracking last element for O(1) appends. Single linked lists would still use almost 2x the mem of BList it would seem for reasonable sized lists. BList actually is abstracted to the point where it COULD use a single linked list internally until it reaches size N then switch. Part of the API design is to now hide that detail. BList could flatten to one big array too or vary block sizes (start with small 64byte then expand to 128, then 256 etc.). A single linked list (on 64bit) will still cost 16 bytes per node anyway, so 4 items and they even out for the first time, 8 items and the BList will always be smaller.

You are right that the shuffling of items is a cost and if you do a LOT of list manipulation it could be costly. But this would only be correct if the list nodes are always in cache. at some point (L2 or L3 or RAM) when a list lives there, the cost of shuffling (on average) is cheaper than the memory waits that more cache misses create. That point will probably vary wildly per list and the usage patterns AND the architecture and chip setup (caches, speed of caches, sizes and how many, memory bus latency vs CPU speed etc.). This is precisely why we need a list-like data struct that hides these details so they can even change at runtime based on these factors. I think this is a very good discussion to have - what internal strategies should exist and at which points should they be switched.

I have done at least the memory profiling and Eina_List is one of those fairly big memory users in EFL. It's been on my TODO list to try address this for a long time. Just try it for yourself. I ran E in "wl in x11 window" mode here: ran it for a bit opened some windows, efm, applications menu... and of the 31M of memory massif reports we use, 848k+80k (928k) is just Eina List nodes... that's a lot from JUST there. of our mempool usage (total mempool use is 4.1M) that's a large chunk and that's what attracts my attention. Then trying elementary_Test - open up genlist test, buttons, checks, bubble window, scroll around a bit - picking one of the massive sample along the way while everything was up (not a peak at start or something), I see a 20.7M footprint, of which 5.2M is coming from mempools and 1.1M+132k+48k (1306k) just from Eina List memory usage. So Eina List is definitely showing up at like 3-6% of our entire mem footprint - and that includes image data object structs which are far larger etc. etc. so to me this seems a worthy thing to attack in some way.

I indeed don't have numbers on list sizes, access patterns etc. etc. but what I really would need to make this useful is some idea of cache hits and misses for list accesses. If I could get this AND then get a log out of e.g. a run of e or elementary_test i could even replay the log with a new list struct and see. it'd be a lot more work than what I've done so far, so I'm going by some gut instinct, experience and history of profiling a lot of things in EFL and a general knowledge of how the code would generally flow. Reproducing this in an exact way outside in an isolated test case will be a fair bit more work. Is there a way to know if a ptr is in or "out" of cache (in L1, L2, L3 or ram) and the rough cost of this other than actually timing it which i think will be really quantum-mechanics like in that measuring the clock with gettimeofday will totally affect/hurt the profile while trying to gather it. any pointers? there is cachegrind but i wonder if it's really useful for just this... as i'd like to see the kind of latency profile for accessing list nodes over time... like each node has a history like unsigned short latencies[] = { 1, 1, 12, 1, 80, 1, 1, 150, 80, 1, 1, 12, 154 }; - like some log of them or something. You get the idea.

raster added a comment.EditedDec 19 2018, 7:10 AM

I did some benchmarking across many machines, generations and architectures. I scripted it as part of a build script. All binaries were compiled with -O3 -march=native (except on aarch64 where -march=native is not valid). The results are consistent: That BList always wins, and the wins get much bigger the longer the list and the less the list fits into cache. These are run times so the shorter the run time, the better. On the 32bit ARM systems and on the Baytrail Atom system they did 1/10th the number of loops so they complete in a reasonable time.

*ARM: Cavium ThunderX2 - ThunderX2 (ARM64 256x @ 2.5Ghz)*

./blist blist 10 100000$SPEED  1.59s user 0.00s system 99% cpu  1.586 total
./blist blist 100 10000$SPEED  1.53s user 0.00s system 99% cpu  1.535 total
./blist blist 1000 1000$SPEED  1.59s user 0.00s system 99% cpu  1.599 total
./blist blist 10000 100$SPEED  3.88s user 0.17s system 99% cpu  4.057 total
./blist blist 100000  1$SPEED  3.68s user 0.03s system 99% cpu  3.708 total
==============
./blist list 10 100000$SPEED   7.85s user 0.00s system 99% cpu  7.849 total
./blist list 100 10000$SPEED   7.93s user 0.01s system 99% cpu  7.941 total
./blist list 1000 1000$SPEED   9.90s user 0.00s system 99% cpu  9.928 total
./blist list 10000 100$SPEED  49.23s user 0.00s system 99% cpu 49.241 total
./blist list 100000  1$SPEED  43.89s user 0.00s system 99% cpu 43.900 total

*ARM: ODROID XU3 - Exynos 5422 (ARM, 4xA15 @ 2Ghz, 4xA7 @ 1.4Ghz)*

./blist blist 10 100000$SPEED  0.50s user 0.00s system 99% cpu  0.506 total
./blist blist 100 10000$SPEED  0.39s user 0.00s system 99% cpu  0.401 total
./blist blist 1000 1000$SPEED  0.40s user 0.01s system 99% cpu  0.408 total
./blist blist 10000 100$SPEED  0.84s user 0.01s system 99% cpu  0.855 total
./blist blist 100000  1$SPEED  0.53s user 0.01s system 99% cpu  0.544 total
==============
./blist list 10 100000$SPEED   2.95s user 0.01s system 99% cpu  2.976 total
./blist list 100 10000$SPEED   2.53s user 0.01s system 99% cpu  2.557 total
./blist list 1000 1000$SPEED   2.92s user 0.01s system 99% cpu  2.948 total
./blist list 10000 100$SPEED  10.22s user 0.01s system 99% cpu 10.286 total
./blist list 100000  1$SPEED   8.63s user 0.01s system 99% cpu  8.703 total

*ARM: Raspberry Pi3 - Broadcom BCM2837B0 (4xA53 @ 1.3Ghz)*

./blist blist 10 100000$SPEED  0.72s user 0.00s system 99% cpu  0.722 total
./blist blist 100 10000$SPEED  0.66s user 0.00s system 99% cpu  0.663 total
./blist blist 1000 1000$SPEED  0.70s user 0.01s system 99% cpu  0.711 total
./blist blist 10000 100$SPEED  1.50s user 0.02s system 99% cpu  1.521 total
./blist blist 100000  1$SPEED  1.21s user 0.00s system 99% cpu  1.215 total
==============
./blist list 10 100000$SPEED   1.66s user 0.01s system 99% cpu  1.668 total
./blist list 100 10000$SPEED   1.58s user 0.01s system 99% cpu  1.594 total
./blist list 1000 1000$SPEED   1.71s user 0.00s system 99% cpu  1.716 total
./blist list 10000 100$SPEED   4.70s user 0.01s system 99% cpu  4.714 total
./blist list 100000  1$SPEED   4.05s user 0.01s system 99% cpu  4.067 total

*x86: NUC - Intel Atom Baytrail - E3815 @ (1x1.46Ghz)*

./blist blist 10 100000$SPEED  0.60s user 0.01s system 99% cpu  0.617 total
./blist blist 100 10000$SPEED  0.51s user 0.01s system 99% cpu  0.524 total
./blist blist 1000 1000$SPEED  0.50s user 0.02s system 99% cpu  0.522 total
./blist blist 10000 100$SPEED  0.92s user 0.01s system 99% cpu  0.933 total
./blist blist 100000  1$SPEED  1.37s user 0.02s system 99% cpu  1.402 total
==============
./blist list 10 100000$SPEED   2.52s user 0.01s system 99% cpu  2.548 total
./blist list 100 10000$SPEED   2.63s user 0.02s system 99% cpu  2.665 total
./blist list 1000 1000$SPEED   3.68s user 0.02s system 99% cpu  3.720 total
./blist list 10000 100$SPEED  24.75s user 0.03s system 99% cpu 24.932 total
./blist list 100000  1$SPEED  22.77s user 0.03s system 99% cpu 22.971 total

*x86: Desktop - Intel i7-3770 (8x @ 3.4Ghz)*

./blist blist 10 100000$SPEED  1.27s user 0.00s system 99% cpu  1.273 total
./blist blist 100 10000$SPEED  1.22s user 0.00s system 99% cpu  1.225 total
./blist blist 1000 1000$SPEED  1.28s user 0.00s system 99% cpu  1.284 total
./blist blist 10000 100$SPEED  2.35s user 0.00s system 99% cpu  2.348 total
./blist blist 100000  1$SPEED  1.42s user 0.00s system 99% cpu  1.415 total
==============
./blist list 10 100000$SPEED   5.04s user 0.00s system 99% cpu  5.042 total
./blist list 100 10000$SPEED   5.89s user 0.00s system 99% cpu  5.886 total
./blist list 1000 1000$SPEED   7.27s user 0.00s system 99% cpu  7.270 total
./blist list 10000 100$SPEED  33.53s user 0.00s system 99% cpu 33.527 total
./blist list 100000  1$SPEED  29.47s user 0.00s system 99% cpu 29.478 total

*x86: Desktop - Intel i7-7700K (8x @4.2Ghz)*

./blist blist 10 100000$SPEED  0.93s user 0.00s system 99% cpu  0.928 total
./blist blist 100 10000$SPEED  0.86s user 0.00s system 99% cpu  0.861 total
./blist blist 1000 1000$SPEED  0.88s user 0.00s system 99% cpu  0.878 total
./blist blist 10000 100$SPEED  1.65s user 0.00s system 99% cpu  1.654 total
./blist blist 100000  1$SPEED  1.09s user 0.00s system 99% cpu  1.093 total
==============
./blist list 10 100000$SPEED   1.20s user 0.00s system 99% cpu  1.202 total
./blist list 100 10000$SPEED   1.24s user 0.00s system 99% cpu  1.242 total
./blist list 1000 1000$SPEED   1.62s user 0.00s system 99% cpu  1.624 total
./blist list 10000 100$SPEED  11.78s user 0.00s system 99% cpu 11.781 total
./blist list 100000  1$SPEED  11.97s user 0.00s system 99% cpu 11.983 total

*x86: Dell XPS13 - Intel i7-8550U (8x @1.8Ghz)*

./blist blist 10 100000$SPEED  1.11s user 0.00s system 99% cpu  1.110 total
./blist blist 100 10000$SPEED  0.99s user 0.00s system 99% cpu  0.991 total
./blist blist 1000 1000$SPEED  0.98s user 0.00s system 99% cpu  0.981 total
./blist blist 10000 100$SPEED  1.78s user 0.00s system 99% cpu  1.778 total
./blist blist 100000  1$SPEED  1.25s user 0.00s system 99% cpu  1.250 total
==============
./blist list 10 100000$SPEED   1.39s user 0.00s system 99% cpu  1.395 total
./blist list 100 10000$SPEED   1.41s user 0.00s system 99% cpu  1.418 total
./blist list 1000 1000$SPEED   1.82s user 0.00s system 99% cpu  1.817 total
./blist list 10000 100$SPEED  13.30s user 0.00s system 99% cpu 13.309 total
./blist list 100000  1$SPEED  13.46s user 0.00s system 99% cpu 13.474 total

So your results @ManMower seem to totally clash with mine on a whole host of systems and architectures. :) I don't know why.

vtorri added a subscriber: vtorri.Dec 19 2018, 11:41 AM

on Windows :

E:\Documents\MSYS2\tmp\ccgvVKzX.o:eina_blist.c:(.text+0x16e): undefined reference to `posix_memalign'

Ok, I've gone back to your original code to make sure I'm testing the same thing you are, and I've rebuilt my EFL and this benchmark with -O3 -march=native
i7-8600K:

./blist blist 10 100000     0.084s
./blist blist 100 10000     0.076s
./blist blist 1000 1000     0.076s
./blist blist 10000 100     0.150s
./blist blist 100000 1      0.111s
==============
./blist list 10 100000      0.077s
./blist list 100 10000      0.075s
./blist list 1000 1000      0.097s
./blist list 10000 100      1.053s
./blist list 100000 1       1.172s

So for me, somewhere between the 3rd and 4th test is where the change becomes "significant".

@vtorri a stackoverflow commentor claims:
#define posix_memalign(p, a, s) (((*(p)) = _aligned_malloc((s), (a))), *(p) ?0 :errno)

should be equivalent for windows, but I have no idea if that's really the case!

yes, i'vealso seen this in SO. I've changed the blist code directly...
now i'm recompiling the efl 1.21 with -O3 -march=native ...

Some questions..

I hadn't immediately noticed posix_memalign - indeed that invalidates concern about losing alignment. However, will it bloat memory consumption during normal workloads, since other code might be doing allocations between blist chunks? Also, I'm now wondering if blist would consume significantly less memory if the alignment left some pad for malloc bookkeeping...

How is it possible that the rpi3 is beating the i7s? I realize the pointer sizes are different, but that still seems improbable? Your golden sample PI3 is also beating the odroid in the eina_list test. I think that alone needs further investigation? :)

I tried eina_list with mempool=pass_through and it actually improved eina_list performance in this benchmark. This surprised me, should it have?

Should we really be including eina_list_remove() and the relative ops in the benchmark? I think they're worst case scenarios for eina_list ops, as they do a list search - normally an item would be removed with eina_list_remove_list() which shouldn't incur a walk? If I remove them from both blist and list tests I get results like this:

./blist blist 10 100000    0.086s
./blist blist 100 10000    0.082s
./blist blist 1000 1000    0.079s
./blist blist 10000 100    0.081s
./blist blist 100000 1     0.011s
==============
./blist list 10 100000     0.063s
./blist list 100 10000     0.059s
./blist list 1000 1000     0.065s
./blist list 10000 100     0.112s
./blist list 100000 1      0.015s

Which make this a lot less compelling...

I'll give the code a deeper read in the near future.

Full disclosure: I want to write it off just because we already have clist, list, and inlist. But I'll try my best to give it a fair shake.

since, you know, old hardware is still great which is why i use linux and enlightenment in the first place:

on an about 10 year old E5800 @ 3.20GHz with 6gb ram, linked against efl 1.20.7 on openSuse, build with -O3 -march=native

/usr/bin/time --format="%C %U user %S system %P cpu %e total"

./blist blist 10 1000000   2.12 user 0.00 system 98% cpu 2.15 total
./blist list  10 1000000   4.02 user 0.00 system 99% cpu 4.03 total
./blist blist 100 100000   1.68 user 0.00 system 99% cpu 1.68 total
./blist list  100 100000   4.00 user 0.00 system 99% cpu 4.00 total
./blist blist 1000 10000   1.74 user 0.00 system 99% cpu 1.74 total
./blist list  1000 10000   4.26 user 0.00 system 99% cpu 4.29 total
./blist blist 10000 1000   3.40 user 0.00 system 99% cpu 3.41 total
./blist list  10000 1000  13.04 user 0.10 system 98% cpu 13.31 total
./blist blist 100000  10   9.71 user 0.05 system 99% cpu 9.84 total
./blist list  100000  10  27.06 user 0.42 system 98% cpu 27.88 total
./blist blist 1000000  1 149.78 user 0.50 system 98% cpu 153.04 total
./blist list  1000000  1 274.26 user 0.69 system 97% cpu 280.71 total

Raspberry Pi Zero, ARMv6-compatible processor rev 7 (v6l), Hardware: BCM2835, linked against efl 1.21.1 on arch, build with -O3 -march=native

/usr/bin/time --format="%C %U user %S system %P cpu %e total"

./blist blist 10 1000000 15.25 user 0.00 system 99% cpu 15.31 total
./blist list  10 1000000 69.74 user 0.01 system 99% cpu 70.16 total
./blist blist 100 100000 13.32 user 0.02 system 99% cpu 13.39 total
./blist list  100 100000 77.14 user 0.00 system 99% cpu 77.60 total
./blist blist 1000 10000 14.94 user 0.01 system 99% cpu 15.01 total
./blist list  1000 10000 91.06 user 0.00 system 99% cpu 91.59 total
ProhtMeyhet added a comment.EditedDec 19 2018, 1:30 PM

How is it possible that the rpi3 is beating the i7s? I realize the pointer sizes are different, but that still seems improbable? Your golden sample PI3 is also beating the odroid in the eina_list test. I think that alone needs further investigation? :)

at first i thought probably just different kernel versions, then i remembered spectre, meltdown, foreshadow and the mitigation the kernel has to do for each, which has changed a lot between the last kernel versions. so maybe these comparisons are meaningless without considering kernel versions...

@ManMower - it seems your machine is an outlier so far, but i guess the principle still counts. I should modify the brenchmark to pollute cache at times, but then the benchmark needs to do its own timings as it has to ignore the pollution bits, so it's going to become a fair bit more complex than it is.

However, will it bloat memory consumption during normal workloads, since other code might be doing allocations between blist chunks? Also, I'm now wondering if blist would consume significantly less memory if the alignment left some pad for malloc bookkeeping...

Your concern about memory bloat is correct, but this is the case with Eina List too. It's no different other than Eina List will allocate many many many more chunks for the same number of items in a list. One per item. BList allocates 1 chunk per 14 items on 64bit (per 30 items on 32bit), so many fewer allocations and frees and fewer chunks. I haven't used a mempool here because this is meant to be a discussion on design and ideas, not meant to be a highly tuned and optimized implementation. I am not sure if a mempool is even worth it, but it would trade off malloc bookkeeping overhead vs mempool over-allocation and fragmentation. This can happily uses mempools like Eina List does internally. It's a bit of extra code to init it and replace the allocating/freeing code. I don't ultimately think this should use malloc/calloc/posix memalign - this is a start to look at the broader strokes.

How is it possible that the rpi3 is beating the i7s?

As for rpi3 beating i7 - i dropped a 0 from the runs. that $SPEED is just an extra digit in my script, so on faster machines they did 10x more runs of the same benchmark vs the slower ones where i dropped it down so it'd complete in a sane amount of time. That's what the 2nd number is. # of times it re-does the same benchmark run. I did run both the XU3 and Rpi3 with 1/10th the runs (I did explain this in the blurb above the benchmarks ... I also ran the Atom too with 1/10th the runs). The absolute numbers don't matter across systems. It's blist vs list as relative performance on the same system that matters. Don't get distracted by these. :)

Should we really be including eina_list_remove() and the relative ops in the benchmark?

As for remove - yeah. We should include it. For example the clipees list in evas objects regularly has things removed. Just counting occurrences across efl and its tools of eina_list_remove() I see 457. eina_list_append() has 1742, eina_list_prepend() 198, eina_list_append_relative() 53, eina_list_prepend_relative() 58. I'm skipping any usage inside eina. so eina_list_remove() is pretty much the 2nd most commonly used eina_list modification API at least by src code scan (I haven't counted runtime usage as that'll take longer to get numbers but my gut says it's also an important and common call there too). It's costly as it combines a walk (which is not cache friendly) plus the removal work, but that's how the cookie crumbles. It's only fair that it be tested well. :)

I tried eina_list with mempool=pass_through and it actually improved eina_list performance in this benchmark.

That's possible. In these cases the ONLY allocations are for lists, so it may be that in these cases malloc does a better job as it actually can use the single same pool of same sized objects internally. My understanding of how malloc's implementation in glibc works is that it actually does do this kind of thing. it creates pools of same-sized objects for the smaller allocation sizes to reduce overhead etc.

Which make this a lot less compelling...

As above - I think it's an important call. Look how often it is used... :) My gut tells me I'm in the right ballpark at least. A full profile of runtime usage would give better stats, but it'd introduce the need to replay a log where the cost of reading the log might become significant vs. the list stuff. Result usefulness/accuracty vs. timer invested I thiunk is probably not worth it. I think maybe just getting some percentages of which calls are used most/least might help me tune the benchmark. I'll have to throw in some self timing then too to remove the cache pollution i'll have to do.

@ProhtMeyhet you seem to have slightly older hardware than me, but that indeed shows the same trend as my benchmarks.The trade-off of cache friendliness vs some extra work in shuffling small arrays around seems to pay off. Well in all cases except @ManMower :) Thanks for the extra numbers.

I updated my benchmarks for the intl Atom - the system wasn't completely idle. fixed that now. was idle. still tells the same story relatively speaking of blist vs list.

vtorri added a comment.EditedDec 20 2018, 4:00 AM

On Windows (i7 6700K):

./blist blist 10 100000    0m0,111s
./blist blist 100 10000    0m0,111s
./blist blist 1000 1000    0m0,115s
./blist blist 10000 100    0m0,178s
./blist blist 100000 10    0m1,122s

./blist list 10 100000    0m0,150s
./blist list 100 10000    0m0,157s
./blist list 1000 1000    0m0,183s
./blist list 10000 100    0m1,289s
./blist list 100000 10    0m12,581s

@ManMower - it seems your machine is an outlier so far, but i guess the principle still counts. I should modify the brenchmark to pollute cache at times, but then the benchmark needs to do its own timings as it has to ignore the pollution bits, so it's going to become a fair bit more complex than it is.

Surprised to continue to be an outlier, I wonder if gcc version matters.

However, will it bloat memory consumption during normal workloads, since other code might be doing allocations between blist chunks? Also, I'm now wondering if blist would consume significantly less memory if the alignment left some pad for malloc bookkeeping...

Your concern about memory bloat is correct, but this is the case with Eina List too. It's no different other than Eina List will allocate many many many more chunks for the same number of items in a list. One per item. BList allocates 1 chunk per 14 items on 64bit (per 30 items on 32bit), so many fewer allocations and frees and fewer chunks. I haven't used a mempool here because this is meant to be a discussion on design and ideas, not meant to be a highly tuned and optimized implementation. I am not sure if a mempool is even worth it, but it would trade off malloc bookkeeping overhead vs mempool over-allocation and fragmentation. This can happily uses mempools like Eina List does internally. It's a bit of extra code to init it and replace the allocating/freeing code. I don't ultimately think this should use malloc/calloc/posix memalign - this is a start to look at the broader strokes.

Not exactly my point though...

You ask for 128 bytes with 128 byte alignment.

you get that, but there's some malloc/memalign bookkeeping in that, so you've ACTUALLY allocated some impl defined amount, let's say 136 bytes.
If you're the next allocation request, you'll need to throw away an entire 120 bytes to get your alignment back. If you'd left "enough" <shrug> space in there for the bookkeeping, you'd throw away those 8 extra bytes or whatever, but your next allocation wouldn't need to waste 120 to get aligned.

Will malloc be able to piece out those 120 byte areas we skipped for alignment?

I think that might potentially be an area for further testing - it also doesn't show well in the microbenchmark because those areas are pure waste since there are no smaller allocations to test.

I guess this is just another reason that "polluting" the microbench to make it more real world - as you've suggested, but I think I'm curious about more/different "pollution" than you had in mind :) is a very good idea. Among other things, you'd see if small mallocs fill the voids and make the blist waste less prominent.

Should we really be including eina_list_remove() and the relative ops in the benchmark?

As for remove - yeah. We should include it. For example the clipees list in evas objects regularly has things removed. Just counting occurrences across efl and its tools of eina_list_remove() I see 457. eina_list_append() has 1742, eina_list_prepend() 198, eina_list_append_relative() 53, eina_list_prepend_relative() 58. I'm skipping any usage inside eina. so eina_list_remove() is pretty much the 2nd most commonly used eina_list modification API at least by src code scan (I haven't counted runtime usage as that'll take longer to get numbers but my gut says it's also an important and common call there too). It's costly as it combines a walk (which is not cache friendly) plus the removal work, but that's how the cookie crumbles. It's only fair that it be tested well. :)

Are there places where additional cleverness can be applied to remove removes? Any place we can remove an eina_list_remove() and replace it with eina_list_remove_list() is potentially a win, made more important the longer the list gets...

So I'm wondering if there are places where we can harmlessly replace eina_list_append() with eina_list_prepend() (is that everywhere? we could just walk the list in reverse on usage...) - the return value of the latter being something we can use for eina_list_remove_list() later, so we could store it... I think I'll take a look around for such cases, I suspect there are a few. So we'd throw away a pointer's size worth of memory to replace a list walk with an O(1) remove...

I think blist can't easily have eina_list_remove_list() type functionality - an O(1) remove is... very difficult. once a shuffle happens anything you'd have hung on to would be invalidated (I think you'd already mentioned this).

So if we carelessly replace things with blist, we could actually lose opportunity to reduce algorithmic complexity (N -> 1)...

Which make this a lot less compelling...

As above - I think it's an important call. Look how often it is used... :) My gut tells me I'm in the right ballpark at least. A full profile of runtime usage would give better stats, but it'd introduce the need to replay a log where the cost of reading the log might become significant vs. the list stuff. Result usefulness/accuracty vs. timer invested I thiunk is probably not worth it. I think maybe just getting some percentages of which calls are used most/least might help me tune the benchmark. I'll have to throw in some self timing then too to remove the cache pollution i'll have to do.

Hehe, I'd love to see that - but once I start working out in my head how long it'd take me to implement such a thing, it gets pretty daunting... But yes, I think that would give very meaningful measurements.