Page MenuHomePhabricator

evas_common: use memcpy to copy pixel buffer
ClosedPublic

Authored by ali.alzyod on Jun 22 2019, 1:18 AM.

Details

Summary

This function has no special processing when copy data, so using memcpy can enhance performance

Test Plan
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>

typedef unsigned int       DATA32;

static void
oldFunc(DATA32 *src, DATA32 *dst, size_t len)
{
   DATA32 *dst_end = dst + len;
   while (dst < dst_end)
      *dst++ = *src++;
}

static void
newFunc(DATA32 *src, DATA32 *dst, size_t len)
{
   memcpy(dst, src, len * sizeof(DATA32));
}

int main()
{

   int counter = 1000;
   srand(time(NULL));
   DATA32 src[50000] = {0};
   DATA32 dst[50000] = {0};

   for (int i = 0; i < 50000; i++)
      src[i] = rand();

   clock_t start, end;
   double total_Time1 = 0;
   int i;

   start = clock();
   for (i = 0; i < counter; i++)
      oldFunc(src, dst, 50000);
   end = clock();
   total_Time1 = ((double)(end - start)) / CLOCKS_PER_SEC;
   printf("original = %f \n", total_Time1);

   start = clock();
   for (i = 0; i < counter; i++)
      newFunc(src, dst, 50000);
   end = clock();
   total_Time1 = ((double)(end - start)) / CLOCKS_PER_SEC;
   printf("modified = %f \n", total_Time1);
}

Diff Detail

Repository
rEFL core/efl
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
ali.alzyod created this revision.Jun 22 2019, 1:18 AM

It seems that this patch has no reviewers specified. If you are unsure who can review your patch, please check this wiki page and see if anyone can be added: https://phab.enlightenment.org/w/maintainers_reviewers/

ali.alzyod requested review of this revision.Jun 22 2019, 1:18 AM
vtorri added a subscriber: vtorri.Jun 22 2019, 3:34 AM

what about making this function inline ?

what about making this function inline ?

I can not tell it may need more investigation, but I think it is guarantee change memory copy technique will speed up the process

Hermet requested changes to this revision.Jun 25 2019, 12:36 AM
Hermet added a subscriber: Hermet.

this c implmentation could be compared with simd implementation.
So, hjust leave it as it does, even memcpy() is not better than current one.

If you see memcpy()

00035 if ((uintptr_t)dst % sizeof(long) == 0 &&
00036 (uintptr_t)src % sizeof(long) == 0 &&
00037 len % sizeof(long) == 0) {
00038
00039 long *d = dst;
00040 const long *s = src;
00041
00042 for (i=0; i<len/sizeof(long); i++) {
00043 d[i] = s[i];
00044 }
00045 }
00046 else {
00047 char *d = dst;
00048 const char *s = src;
00049
00050 for (i=0; i<len; i++) {
00051 d[i] = s[i];
00052 }
00053 }
00054

This revision now requires changes to proceed.Jun 25 2019, 12:36 AM
ali.alzyod edited the test plan for this revision. (Show Details)Jun 25 2019, 12:56 AM
ali.alzyod edited the test plan for this revision. (Show Details)Jun 25 2019, 1:25 AM
ali.alzyod added a comment.EditedJun 25 2019, 1:29 AM

@Hermet Please run test code, you can see will notice the speedup (without optimization flags).

No optimization/ or default optimization:

original = 0.107935
modified = 0.011056

-O3 optimization or -Ofast

No real speed up

So it may be good when we do not use optimization flags for local build

ali.alzyod requested review of this revision.Jun 25 2019, 1:29 AM
ali.alzyod abandoned this revision.Jun 25 2019, 3:10 AM
ali.alzyod changed the visibility from "Public (No Login Required)" to "No One".Jun 25 2019, 3:31 AM
ali.alzyod reclaimed this revision.Jun 25 2019, 3:34 AM
ali.alzyod changed the visibility from "No One" to "Public (No Login Required)".

can you please modify it so that evas_common_copy_pixels_c() is inlined and see if there are perf improvements ?

ali.alzyod added a comment.EditedJun 25 2019, 4:41 AM

@vtorri I modified the example, and did not notice improvement.
Maybe example does not cover all cases (I think number of calls to evas_common_copy_pixels_c, and size of buffer can effect the time consumed)

Hermet accepted this revision.Jun 25 2019, 5:47 AM
This revision is now accepted and ready to land.Jun 25 2019, 5:47 AM
This revision was automatically updated to reflect the committed changes.

@Hermet just for the record, gcc usually has its own internal view of what memcpy is and might directly inline it which allow for more optimization than what gcc would do by relying on libc or our custom version. Arguably to help the compiler better in this file, adding const and restrict around might be a good idea worth trying. Inline might help depending on the targeted hardware. I would still recommend using it.

@ali.alzyod what -march and -mtune flags do you use for checking the assembly? On x86, I think you still need to specify that you want AVX, SSE3 or SSE4 and even FMA4 or FMA3 require to give -mfma and friends.

@cedric I did not check the generated assembly code, I only compare execution time for both cases