Page MenuHomePhabricator

Performance Regression because of patch ( eet - dictionary - use rwlocks instead of spinlocks )
Closed, ResolvedPublic

Description

Hi Raster,

  Your commit c0d6cbab97a1fac072aa1a7101165d9db25903a7 (eet - dictionary - use rwlocks instead of spinlocks - better contention)  has a performance regression both on
  Linux Desktop (Ubuntu) as well as Tizen .  eet_data_read() takes double the time because of change from spinlock to rwlock.

 I have some vtune profiling data i can share with you if you need . But the simple test will be put ecore_time_get() to see how much time that function takes.

Is there any case where we are reading from the eet file from multiple thread simultaneously (in Efl) . As eet_data_read() will call eet_dictionary functions lot of time (depending on how big data structure we are reading) the number of locks and unlocks are really huge which takes 60-70 % of total read time .. instead of that if we take the lock from dictionary at start and then do the read and release the lock that will improve the performance a lot (of course it will make the eet_data_read() function sequential). atleast the read time is predictable .
  1. I am still trying to figure out why eet_dictionary_get() API is public (this is a internal data structure of eet which is there to reduce the memory by using same string(if i am correct) by making it public we are forced to put locks on every api get/set which affect the base use case (reading from eet data).

Please let me know your thoughts on this.

Data: By changing the locks back to spinlock the startup time improved by 60ms . and removing the locks all together improved another 20ms.

smohanty created this task.Aug 8 2019, 6:24 PM
smohanty triaged this task as High priority.
raster added a comment.Aug 9 2019, 4:19 AM
  1. i moved to rwlocks because iw as profiling .. i think it may have been edje_cc writing out and it was like really bad with lots of threads using 100% cpu as they sat on spinlocks while writing stuff out. i think that was the case from memory.
  1. it needs locks because the rest of eet has locks and eet is used across threads inside of efl... so locks

are needed here because of the way we are using eet. we have eet loaders that run in threads in in theory they can access anything in an eet file so they need these locks. but i think @credic made this api public - not me, so i can't answer as to why it's public, BUT it is used outside of eet itself for example in edje so realistically the locks kind of have to stay. we can argue on how to optimize them and keep the functionality... :)

  1. my take on optimizing here is to stand back and redo how we encode data structures. i've been over this before on irc and in person with people. my take is we need to put data structs in a directly mmaped chunk of ram and NEVER decode or decompress. always access directly. this saves decompress/decode time AND saves memory as we never have a decode step. yes - it's a lot more work. a LOT more. but it gets a LOT more gains. for example imagine this:
  1. data is mmaped always into memory.
  2. all data has an index section that maps lets's say known field strings to field offsets in the file
  3. all data is encoded in blobs where fields are stored in varying sizes and encodings. so we have a number type that is a variable number of bits or even no bits for "zero" or some common constants
  4. the mapping table simply says "field 23 is stored in encoding 3 at mmap offset 872" so any blob (struct) of data is just a stream of these. we can build dummy struct to mapping table using offsetof
  5. any fetch of a field is always via a static inline/macro that simply uses the offset of some struct field plus typeof to know where to jump in the mmaped memory to find the source data, decode it very simply (eg just read a char then pad the upper bits with 0 if we want an int but the encoded version uses only 8 bits) and return that. so we need to make struct-like access macros/funcs like:

    DT(read_dest, mapped_data, struct xxx, field);

where something like:

#define DT(dest, data, strct, field) dest = (typeof(field)) \
  dt_read(data, #typeof(field), offsetof(strct, field))

and

static inline uintptr_t dt_read(unsigned char *data, const char *type, size_t field) {
  unsigned int *offset_table = (unsigned int *)(data + 0); 
  if (type == "int") {
    return (uintptr_t)(int)htonl(*((uint32_t *)(data + offset_table[field])));
  } else if (type == "char") {
    return (uintptr_t)(char)*((char *)(data + offset_table[field]));
  } else if (type == "char *") {
    return (uintptr_t)(char *)(data + offset_table[field]);
  }
}

it's not all right. no verification steps or sanity checks, or even compression of types or the field values so it's a bit sparse - but the idea is a data blob is simply a field index with offsets where to find the data - a string is simple - at offset X is where the string starts. for other values its just a simple offset where a fixed value (int, char, whatever) sits and we decode it into native order WHEN we access. yes. accesses now are a bit more expensive, but there is no decompress/decode and no extra memory used. these data structs would be shared between all processes so mapped only once in memory (great for themes). we can come up with variable bit/byte encodings with field offset magic numbers meaning well know constant values so we have no offset to jump to. imagine offset 0 == explicit value 0. offset 1 == explicit value 1 (very common values).

my problem with this is finding a way of doing this and have the "read" compile down to almost no code in common cases (so optimizer will reduce it to a few instructions at worst) AND allow fields to be added over time and not require fixed locations, AND to allow fields to be skipped (new fields in data file old code does not know about) etc. etc. - i want the field numbers to be like 0, 1, 2, 3, 4, not 0, 4, 8, 16, 17 like they would be with offsetof+alignment requirements. i've been mulling this one over from time to time over the years. waiting for inspiration. :)

but my main point is -> this is the path to go for optimizations imho :) zero up-front decode cost, zero memory overhead (no local allocs). need to also figure out arrays/lists (sets) of data etc.

  1. it needs locks because the rest of eet has locks and eet is used across threads inside of efl... so locks are needed here because of the way we are using eet. we have eet loaders that run in threads in in theory they can access anything in an eet file so they need these locks. but i think @credic made this api public - not me, so i can't answer as to why it's public, BUT it is used outside of eet itself for example in edje so realistically the locks kind of have to stay. we can argue on how to optimize them and keep the functionality... :)
 I understand why the lock has to be there .. my only point is why it can't be an exclusive lock ?  am just asking these question so that in case if i do some prototype (for tizen) i just want to make sure to cover all the use case.
 
We have few read Api's (eet_data_read() and eet_read()) as mainly we use read Api's when program runs . Will be any issue if i take lock on eet_data_dictionary once and read all my data and then release the lock .(I know if multiple thread happens to read then one has to wait till other finish) will it cause some other issue ?

Will do the same for write Api's and make sure to remove any use of eet_dictionary_get() in the code. will that cover all the cases ?

Regarding Redesign the Encoding/Decoding Process, As you said it's a big job and requires more effort.
But currently the main bottleneck is taking too many locks while reading the data structure. So if we can refactor to make it exclusive lock then we will improve the
decoding speed a factor of 2 - 2.5x .

As writing to EET is compile time cost but reading is runtime cost . we should concentrate on the runtime cost more than compile time one.

Note: On targets(mobile , wearable) some collection reading are taking upto 25ms .

I will really appreciate if you give your feedback on taking exclusive lock in the Api's.

i thought my git commit log explained that - it sped things up when there was contention ? can you provide much more detailed benchmarks and tools to show all the cases? like simple usage, more complex usage in single threads, multiple threads (a few of them, many of them) etc. ?

if you mean to have an unlocked version of eet_dictionary api's and then manually take read lock on the dict when we do an eet_read... of some sort so it's only taking the lock once per eet_read... ?

Herald raised the priority of this task from High to Showstopper Issues. · View Herald TranscriptAug 10 2019, 3:40 AM
raster lowered the priority of this task from Showstopper Issues to Normal.Aug 10 2019, 4:13 AM

nah - just a performance thing and it's not a huge one... it's only in some code paths. so not showstopper

Herald raised the priority of this task from Normal to Showstopper Issues. · View Herald TranscriptAug 10 2019, 4:13 AM

so do you mean something like: P319

(patch against efl) ? (locked vs unlocked paths to reduce lock/unlocks during decode to a miniumum set).

btw... that should actually be faster than the original spinlock before rwlock as it's even less lock+unlocks. as it's a rwlock multiple readers will not block each-other unlike spinlocks too. only a writer will end up blocking others and that path i have left as-is.

so do you mean something like: P319

(patch against efl) ? (locked vs unlocked paths to reduce lock/unlocks during decode to a minimum set).

Yes this is the exact patch i was talking about (take an exclusive lock and do all the reading and release).

btw... that should actually be faster than the original spinlock before rwlock as it's even less lock+unlocks. as it's a rwlock multiple readers will not block each-other unlike spinlocks too. only a writer will end up blocking others and that path i have left as-is.

Well in Theory yes if multiple threads are reading at once. (which is rare inside efl ).

Usage:

Create edje object -> theme set -> edje_coll_open ->eet_data_read() , these happens before drawing and usually on main thread only
during drawing -> image_load -> eet_read()

As far as i know there is no multiple threads reading at same time (or very rare) main usecase is single thread read .

And I suppose Eina_rw_lock is little bit heavier than spin_lock (system call) and thats why the regression.

I think if multiple thread usage will definitely be better than spin lock s you mentioned.

i thought my git commit log explained that - it sped things up when there was contention ? can you provide much more detailed benchmarks and tools to show all the cases? like simple usage, more complex usage in single threads, multiple threads (a few of them, many of them) etc. ?

if you mean to have an unlocked version of eet_dictionary api's and then manually take read lock on the dict when we do an eet_read... of some sort so it's only taking the lock once per eet_read... ?

I used VTune profiler and ran the basic efl apps. will share you data on Monday (basically hotspot is on locks not on the read)

cool. so i guessed what you wanted. btw that patch has a small bug in it. find it :) (i already have and fixed it locally). it won't affect running things and testing performance though.

well sticking with a rwlock which i think is the right thing here, it should do a whole lot less lock+unlocks and that should basically clear things up. i don't think a spinlock is the way to go as it's far worse for write contention. i've already done some counting: starting up just elementary_test # of rw lock cycles had zero write locks of course, but it went from currently 615604 to 183566 for me. enlightenment (running just my test profile in xephyr) went from 1147892 to 336953. so about 3.4x as many without the above patch. to be honest... this is still a LOT. i'm actually rather surprised at how often we do this... still 180k times for elementary test to start... i count only 28 calls to eet_data_read_cipher() and none to eet_data_read_cipher_buffer, eet_data_node_read_cipher, eet_data_dump_cipher() so... this is not perfect. :)

yeah. actually found bug #2 in that patch. i still have one of my unlocked funcs still lock. forgot to remove them... now down to .... 28 locks to start elm test. so... update. locally with an updated patch... 615604 -> 28 lock+unlock cycles. i think... that will be good enough eh? :) even if a read lock is 3x as expensive... we do it 0.004% as much as before, and still get the write contention improvements. :) i call this a win.

P320 is an updated patch to try out... i'm using this myself now day to day so i think it's good from a stability point of view to hit git master.

let me know what other stuff you find. :) simple things like this are low hanging fruit to fix. yeah - doing a new encoding would be even better as we'd save even more work if done well... but... this is a whole lot fewer atomic operations and atomics are rather expensive. spinlocks need to be atomic too so they are costly as well... just doing them LESS is so much better. it's a tradeoff though against contention then and spinlocks consume 100% cpu during contention. so NOT using them where you get a lot of contention is a good thing. either mutex's or rwlocks are better... but even better is less contention. :)

@raster ,
I saw your patch already landed in master. Awesome :)

I did a quick check on before and after. as expected now most of the time spent on eet_data_chunck_get() which is the expected behaviour.

Before:

After:

Regarding other findings

Right now am concentrating on reducing number of memory allocation and string parsing during startup time as well as memory consumption ( unnecessary ) . Will tag you when I raise a patch or bug against it.
raster closed this task as Resolved.Aug 12 2019, 3:28 AM

cool. i saw those :) let's call this done for now.