|
|
Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)
|
|
|
|
Re: BufferPainter::Clear() optimization [message #54033 is a reply to message #54031] |
Fri, 22 May 2020 11:13   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Quote:I believe that the problem is that memcpyd became too fat and it screws inlining. So the thing to solve now is to find how to remove some if this fat to non-inline.... (svo_memcpy already has such non-inlined part). Probably same should happend to memsetd too....
Hi Mirek,
I think this could be the same phenomenon that caused me issues with 32-bit MSC. It was more critical to code length and the short transfers suffered immediately when code size increased. At the same time MSBT19x64 and both CLANG and CLANGx64 did not experience any trouble. Perhaps MSBT19 did not do as good job with code size as the rest and on my CPU the instruction cache was exhausted. I bet the instruction cache on your CPU is larger than what my i7 has.
At some moment I was thinking of offering the functions as two variants: inline and never_inline, in a way that the never_inline is simply calling the inline. An then when the code benefits from it, calling the never_inline variant.
Then I also thought of handling something like <= 16 .. 32 sizes inline and the rest in a deeper never_inline function. This would probably improve the situation without adding so much complexity.
Best regards,
Tom
|
|
|
|
Re: BufferPainter::Clear() optimization [message #54035 is a reply to message #54032] |
Fri, 22 May 2020 11:32   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
koldo wrote on Fri, 22 May 2020 11:29One question. To use these new features, is it necessary to set compiler flags, like /arch:AVX in Visual Studio?
Hi Koldo,
Here I do not need /arch:AVX or any other compiler flag added. It's just that include (#include <smmintrin.h> or #include <emmintrin.h>, which works for me, I think.
Best regards,
Tom
EDIT: Mirek was faster to respond!
[Updated on: Fri, 22 May 2020 11:34] Report message to a moderator
|
|
|
|
Re: BufferPainter::Clear() optimization [message #54038 is a reply to message #54036] |
Fri, 22 May 2020 11:46   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
[quote title=mirek wrote on Fri, 22 May 2020 12:39]Tom1 wrote on Fri, 22 May 2020 11:13Quote:
Then I also thought of handling something like <= 16 .. 32 sizes inline and the rest in a deeper never_inline function. This would probably improve the situation without adding so much complexity.
In the trunk now... >=16 now handled by non-inline function. There is impact in your benchmark (the one that runs for all sizes), less impact in my benchmark (with ransom sizes), but I think this is the right move...
Another benefit is that we can now consider using AVX (testing for AVX presence would be clumsy in inline function I think).
Mirek
The apex_memmove() did the architecture checking on startup (or first run) and then initialized function pointers to optimal versions. I think we could do this too in some INITBLOCK.
Best regards,
Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #54039 is a reply to message #54036] |
Fri, 22 May 2020 11:59   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
mirek wrote on Fri, 22 May 2020 12:39
In the trunk now... >=16 now handled by non-inline function. There is impact in your benchmark (the one that runs for all sizes), less impact in my benchmark (with ransom sizes), but I think this is the right move...
It looks like >32 might be better in this case... Not sure though.
BR, Tom
|
|
|
|
Re: BufferPainter::Clear() optimization [message #54042 is a reply to message #54039] |
Fri, 22 May 2020 13:01   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Fri, 22 May 2020 11:59mirek wrote on Fri, 22 May 2020 12:39
In the trunk now... >=16 now handled by non-inline function. There is impact in your benchmark (the one that runs for all sizes), less impact in my benchmark (with ransom sizes), but I think this is the right move...
It looks like >32 might be better in this case... Not sure though.
BR, Tom
It in turn makes inlined part bigger.... I would rather be careful there.
OK, for what is worth, I have tried with AVX and I do not see any improvement. Here is the code (for CLANG):
__attribute__((target ("avx")))
never_inline
void memsetd_l2(dword *t, dword data, size_t len)
{
__m128i val4 = _mm_set1_epi32(data);
__m256i val8 = _mm256_set1_epi32(data);
auto Set4 = [&](size_t at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
#define Set8(at) _mm256_storeu_si256((__m256i *)(t + at), val8);
Set4(len - 4); // fill tail
if(len >= 32) {
if(len >= 1024*1024) { // for really huge data, bypass the cache
huge_memsetd(t, data, len);
return;
}
Set8(0); // align up on 16 bytes boundary
const dword *e = t + len;
t = (dword *)(((uintptr_t)t | 31) + 1);
len = e - t;
e -= 32;
while(t <= e) {
Set8(0); Set8(8); Set8(16); Set8(24);
t += 32;
}
}
if(len & 16) {
Set8(0); Set8(8);
t += 16;
}
if(len & 8) {
Set8(0);
t += 8;
}
if(len & 4)
Set4(0);
}
inline
void FillX(void *p, dword data, size_t len)
{
dword *t = (dword *)p;
if(len < 4) {
if(len & 2) {
t[0] = t[1] = t[len - 1] = data;
return;
}
if(len & 1)
t[0] = data;
return;
}
if(len >= 16) {
memsetd_l2(t, data, len);
return;
}
__m128i val4 = _mm_set1_epi32(data);
auto Set4 = [&](size_t at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
Set4(len - 4); // fill tail
if(len & 8) {
Set4(0); Set4(4);
t += 8;
}
if(len & 4)
Set4(0);
}
Frankly I am sort of happy, because GCC/CLANG way of dealing with AVX is really stupid: It declines AVX instrinics, unless you compile whole function for AVX code, but then it starts generating AVX opcodes everywhere and the funciton does not run on non-AVX CPUs anymore.
|
|
|
|
|
Re: BufferPainter::Clear() optimization [message #54046 is a reply to message #54045] |
Fri, 22 May 2020 19:03   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Added memcpy optimized for sizeof 8 and 16 and this little neat function to make sense from it all:
template <class T>
void memcpy_t(T *t, const T *s, size_t count)
{
if((sizeof(T) & 15) == 0)
memcpydq((dqword *)t, (const dqword *)s, count * (sizeof(T) >> 4));
else
if((sizeof(T) & 7) == 0)
memcpyq((qword *)t, (const qword *)s, count * (sizeof(T) >> 3));
else
if((sizeof(T) & 3) == 0)
memcpyd((dword *)t, (const dword *)s, count * (sizeof(T) >> 2));
else
svo_memcpy((void *)t, (void *)s, count * sizeof(T));
}
Vector<String>::ReAlloc(int newalloc)
disassembly now looks magnificent, copying elements to new buffer with SSE2...
|
|
|
|
Re: BufferPainter::Clear() optimization [message #54050 is a reply to message #54049] |
Sun, 24 May 2020 11:56   |
Oblivion
Messages: 1206 Registered: August 2007
|
Senior Contributor |
|
|
Hello Mirek,
On Linux 5.4 and 5.6, with CLANG 10.0
TIMING SSE : 119.41 ms - 119.41 ns ( 1.06 s / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 232.41 ms - 232.41 ns ( 1.18 s / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
On GCC (10.1): apparently _mm_storeu_si32 is yet to be implemented. : 
'_mm_storeu_si32' was not declared in this scope; did you mean '_mm_storeu_epi32'?
(): 47 | _mm_storeu_si32(rgba, PackRGBA(x, _mm_setzero_si128()));
(): | ^~~~~~~~~~~~~~~
(): | _mm_storeu_epi32
Possible workaround is given here:
https://stackoverflow.com/questions/58063933/how-can-a-sse2- function-be-missing-from-the-header-it-is-supposed-to-be-in
Best regards,
Oblivion
Github page: https://github.com/ismail-yilmaz
upp-components: https://github.com/ismail-yilmaz/upp-components
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat
|
|
|
Re: BufferPainter::Clear() optimization [message #54056 is a reply to message #54049] |
Tue, 26 May 2020 13:14   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi!
Sorry for the delay... I was out of town for a while.
Here are my results for Windows 10 pro x64 on Core i7:
MSBT19x64:
TIMING SSE : 37.08 ms - 37.08 ns (50.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 129.08 ms - 129.08 ns (142.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
MSBT19:
TIMING SSE : 29.88 ms - 29.88 ns (45.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 125.88 ms - 125.88 ns (141.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
CLANG:
TIMING SSE : 37.41 ms - 37.41 ns (50.00 ms / 1000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 125.41 ms - 125.41 ns (138.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
CLANGx64:
TIMING SSE : 37.43 ms - 37.43 ns (47.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 129.43 ms - 129.43 ns (139.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
Impressive numbers Mirek! When is this going to be available on BufferPainter?
Best regards,
Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #54057 is a reply to message #54056] |
Tue, 26 May 2020 14:15   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Tue, 26 May 2020 13:14Hi!
Sorry for the delay... I was out of town for a while.
Here are my results for Windows 10 pro x64 on Core i7:
MSBT19x64:
TIMING SSE : 37.08 ms - 37.08 ns (50.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 129.08 ms - 129.08 ns (142.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
MSBT19:
TIMING SSE : 29.88 ms - 29.88 ns (45.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 125.88 ms - 125.88 ns (141.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
CLANG:
TIMING SSE : 37.41 ms - 37.41 ns (50.00 ms / 1000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 125.41 ms - 125.41 ns (138.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
CLANGx64:
TIMING SSE : 37.43 ms - 37.43 ns (47.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
TIMING Non SSE : 129.43 ms - 129.43 ns (139.00 ms / 1000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 1000000
Impressive numbers Mirek! When is this going to be available on BufferPainter?
Best regards,
Tom
I guess by the end of the week. Still fixing bugs + there is like 8 variants to implement...
|
|
|
Re: BufferPainter::Clear() optimization [message #54099 is a reply to message #54057] |
Mon, 01 June 2020 00:39   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
While optimizing memcpy and memset, I have tried a new look at othre things, like String comparison and memhash. I think I improved String::operator== a tiny bit and now I am working on memhash function. Decided to introduce "hash_t" and to have hash value 64 bit when CPU_64.
After bit of experimenting, I have found these functions (one for 64 bit, other 32 bit) work best:
never_inline
uint64 memhash64(const void *ptr, int len)
{
const byte *s = (byte *)ptr;
uint64 val = HASH64_CONST1;
if(len >= 8) {
if(len >= 32) {
uint64 val1, val2, val3, val4;
val1 = val2 = val3 = val4 = HASH64_CONST1;
while(len >= 32) {
val1 = HASH64_CONST2 * val1 + *(qword *)(s);
val2 = HASH64_CONST2 * val2 + *(qword *)(s + 8);
val3 = HASH64_CONST2 * val3 + *(qword *)(s + 16);
val4 = HASH64_CONST2 * val4 + *(qword *)(s + 24);
s += 32;
len -= 32;
}
val = HASH64_CONST2 * val + val1;
val = HASH64_CONST2 * val + val2;
val = HASH64_CONST2 * val + val3;
val = HASH64_CONST2 * val + val4;
}
const byte *e = s + len - 8;
while(s < e) {
val = HASH64_CONST2 * val + *(qword *)(s);
s += 8;
}
return HASH64_CONST2 * val + *(qword *)(e);
}
if(len > 4) {
val = HASH64_CONST2 * val + *(dword *)(s);
val = HASH64_CONST2 * val + *(dword *)(s + len - 4);
return val;
}
if(len >= 2) {
val = HASH64_CONST2 * val + *(word *)(s);
val = HASH64_CONST2 * val + *(word *)(s + len - 2);
return val;
}
return len ? HASH64_CONST2 * val + *s : val;
}
never_inline
uint64 memhash32(const void *ptr, int len)
{
const byte *s = (byte *)ptr;
uint64 val = HASH32_CONST1;
if(len >= 4) {
if(len >= 16) {
uint64 val1, val2, val3, val4;
val1 = val2 = val3 = val4 = HASH32_CONST1;
while(len >= 32) {
val1 = HASH32_CONST2 * val1 + *(dword *)(s);
val2 = HASH32_CONST2 * val2 + *(dword *)(s + 4);
val3 = HASH32_CONST2 * val3 + *(dword *)(s + 8);
val4 = HASH32_CONST2 * val4 + *(dword *)(s + 12);
s += 16;
len -= 16;
}
val = HASH32_CONST2 * val + val1;
val = HASH32_CONST2 * val + val2;
val = HASH32_CONST2 * val + val3;
val = HASH32_CONST2 * val + val4;
}
const byte *e = s + len - 4;
while(s < e) {
val = HASH32_CONST2 * val + *(dword *)(s);
s += 4;
}
return HASH32_CONST2 * val + *(dword *)(e);
}
if(len >= 2) {
val = HASH32_CONST2 * val + *(word *)(s);
val = HASH32_CONST2 * val + *(word *)(s + len - 2);
return val;
}
return len ? HASH32_CONST2 * val + *s : val;
}
While other "mem*" functions are easy to write tests for, hasing is a bit more complicated; can I request some code review here? Basically, I think combination functions are OK, but I would like to be sure it reads exactly len bytes from memory (it is ok if some are read twice...).
Mirek
|
|
|
Goto Forum:
Current Time: Sun May 11 22:22:41 CEST 2025
Total time taken to generate the page: 0.00786 seconds
|
|
|