Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)
|
|
Re: BufferPainter::Clear() optimization [message #53974 is a reply to message #53973] |
Mon, 18 May 2020 21:40   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi Mirek,
Something like this, maybe... I'm not quite sure as this method reports 16M cache for me -- although this works quite well for me:
static int cachesize=999;
INITBLOCK{
#ifdef COMPILER_MSC
int cpuInfo[4];
Zero(cpuInfo);
__cpuid(cpuInfo, 0x80000006);
#else
unsigned int cpuInfo[4];
Zero(cpuInfo);
__get_cpuid(0x80000006, &cpuInfo[0], &cpuInfo[1], &cpuInfo[2], &cpuInfo[3]);
#endif
cachesize=1024*(cpuInfo[2]>>16)*(cpuInfo[2]&0xff);
};
void inline Fill3T(void *b, dword data, int len){
switch(len){
case 3: ((dword *)b)[2] = data;
case 2: ((dword *)b)[1] = data;
case 1: ((dword *)b)[0] = data;
case 0: return;
}
__m128i q = _mm_set1_epi32(*(int*)&data);
__m128i *w = (__m128i*)b;
if(len >= 32) {
__m128i *e = (__m128i*)b + (len>>2) - 8;
if(len >= (cachesize>>2) && ((uintptr_t)w & 3) == 0) { // for really huge data, bypass the cache
_mm_storeu_si128(w, q); // Head align
int s=(-((int)((uintptr_t)b)>>2))&0x3;
w = (__m128i*) ((dword*)b) + s;
do {
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}while(w<=e);
_mm_sfence();
}
else
do {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}while(w<=e);
}
if(len & 16) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 8) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 4) {
_mm_storeu_si128(w, q);
}
_mm_storeu_si128((__m128i*) (((dword*)b) + len - 4), q); // Tail align
}
Best regards,
Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #53975 is a reply to message #53972] |
Mon, 18 May 2020 21:56   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Mon, 18 May 2020 20:57Hi,
My CPU here is Intel(R) Core(TM) i7-4790K:
https://ark.intel.com/content/www/us/en/ark/products/80807/i ntel-core-i7-4790k-processor-8m-cache-up-to-4-40-ghz.html
Not surprisingly, they say it has an 8M 'smart cache'.
Please find attached two CSV files portraying execution time in ns for each call in average. The length is in dwords. Fill3a is there for reference and Fill3T is using 64 dword threshold for streaming in one and 2M dword (8MB) threshold in the other file. While not portrayed here, increasing the threshold above 32MB decreases the performance from 1.5 ms to 3.6 ms for a 32 MB buffer.
Best regards,
Tom
If I interpret these numbers correctly, it looks like around 4MB potential drop because of cache bypass starts to be diminish, right?
Thing is, I am afraid that making this dynamic will cause a lot of problems, starting with perfromance - it is after all another read from the memory. I would settle for some compromise constant there. Like 4MB... 
Mirek
|
|
|
Re: BufferPainter::Clear() optimization [message #53977 is a reply to message #53751] |
Tue, 19 May 2020 00:02   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
What about this:
never_inline
void HugeFill(dword *t, dword c, int len)
{
__m128i val4 = _mm_set1_epi32(*(int*)&c);
auto Set4S = [&](int at) { _mm_stream_si128((__m128i *)(t + at), val4); };
while((uintptr_t)t & 15) { // align to 16 bytes for SSE
*t++ = c;
len--;
}
while(len >= 16) {
Set4S(0);
Set4S(4);
Set4S(8);
Set4S(12);
t += 16;
len -= 16;
}
while(len--)
*t++ = c;
_mm_sfence();
}
void Fill6(dword *t, dword c, int len)
{
if(len >= 4) {
__m128i val4 = _mm_set1_epi32(*(int*)&c);
auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
if(len > 4*1024*1024 / 4) {
HugeFill(t, c, len);
return;
}
while(len >= 16) {
Set4(0);
Set4(4);
Set4(8);
Set4(12);
t += 16;
len -= 16;
}
if(len & 8) {
Set4(0);
Set4(4);
t += 8;
}
if(len & 4) {
Set4(0);
t += 4;
}
}
if(len & 3)
t[0] = t[(len & 2) >> 1] = t[(len & 2) & ((len & 1) << 1)] = c;
}
[Updated on: Tue, 19 May 2020 09:01] Report message to a moderator
|
|
|
|
Re: BufferPainter::Clear() optimization [message #53980 is a reply to message #53979] |
Tue, 19 May 2020 09:14   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Yeah, there was another bug in it, I should test more before posting.
In retrospective, while the trick is nice, I do not think it is worth it. But if you wanted to experiment with this path, I have found the way how to extend / simplify this. The basic idea is
int nlen = -len;
t[1 & HIBYTE(nlen)] = c;
nlen++;
t[2 & HIBYTE(nlen)] = c;
nlen++;
t[3 & HIBYTE(nlen)] = c;
....
(at some point, nlen will become > 0 and thus HIBYTE goes from 0xff to 0x00, thus "grounding" indices).
Also, I would like to try to explain why I am trying to beat Fill3T. It is about those switches, while
switch(len) {
case 0:
case 1:
case 2:
default:
}
looks magnificent, it is actually 2 "unstable" branch predictions and quite a bit of code to compute the target address. So
if(len & 2) {
}
if(len & 1) {
}
should be on par - 2 branch predictions and maybe a bit less of code....
Mirek
|
|
|
|
Re: BufferPainter::Clear() optimization [message #53982 is a reply to message #53751] |
Tue, 19 May 2020 11:32   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Three more variants, based on your FillT. Fill7 is basically identical, with a little trick added (hope you like it). Fill7a has different "frontend". Fill8 is not performing very well, adding that just so that you know I have tested that variant too... 
Fill7 and Fill7a seem to be basically equal and maybe just a tiny bit faster than Fill3T....
void Fill7(dword *t, dword data, int len){
switch(len) {
case 3: t[2] = data;
case 2: t[1] = data;
case 1: t[0] = data;
case 0: return;
}
__m128i val4 = _mm_set1_epi32(data);
auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
Set4(len - 4); // fill tail
if(len >= 32) {
if(len >= 1024*1024) { // for really huge data, bypass the cache
HugeFill(t, data, len);
return;
}
const dword *e = t + len - 32;
do {
Set4(0); Set4(4); Set4(8); Set4(12);
Set4(16); Set4(20); Set4(24); Set4(28);
t += 32;
}
while(t <= e);
}
if(len & 16) {
Set4(0); Set4(4); Set4(8); Set4(12);
t += 16;
}
if(len & 8) {
Set4(0); Set4(4);
t += 8;
}
if(len & 4)
Set4(0);
}
void Fill7a(dword *t, dword data, int len){
if(len < 4) {
if(len & 2) {
t[0] = t[1] = data;
t += 2;
}
if(len & 1)
t[0] = data;
return;
}
__m128i val4 = _mm_set1_epi32(data);
auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
Set4(len - 4); // fill tail
if(len >= 32) {
if(len >= 1024*1024) { // for really huge data, bypass the cache
HugeFill(t, data, len);
return;
}
const dword *e = t + len - 32;
do {
Set4(0); Set4(4); Set4(8); Set4(12);
Set4(16); Set4(20); Set4(24); Set4(28);
t += 32;
}
while(t <= e);
}
if(len & 16) {
Set4(0); Set4(4); Set4(8); Set4(12);
t += 16;
}
if(len & 8) {
Set4(0); Set4(4);
t += 8;
}
if(len & 4)
Set4(0);
}
void Fill8(dword *t, dword data, int len){
switch(len) {
case 3: t[2] = data;
case 2: t[1] = data;
case 1: t[0] = data;
case 0: return;
}
__m128i val4 = _mm_set1_epi32(data);
auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
Set4(len - 4); // fill tail
if(len >= 32) {
if(len >= 1024*1024) { // for really huge data, bypass the cache
HugeFill(t, data, len);
return;
}
int cnt = len >> 5;
do {
Set4(0); Set4(4); Set4(8); Set4(12);
len -= 32;
Set4(16); Set4(20); Set4(24); Set4(28);
t += 32;
}
while(len >= 32);
}
switch((len >> 2) & 7) {
case 7: Set4(24);
case 6: Set4(20);
case 5: Set4(16);
case 4: Set4(12);
case 3: Set4(8);
case 2: Set4(4);
case 1: Set4(0);
}
}
|
|
|
Re: BufferPainter::Clear() optimization [message #53983 is a reply to message #53981] |
Tue, 19 May 2020 12:35   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
mirek wrote on Tue, 19 May 2020 10:49Also, a little note about your testing code: You loop over the same "len" many times and measure that. The problem is that first pass setups branch prediction so all other passes are predicted. If "len" is changing, prediction fails and you might get different results....
Which explains why my tests, which feeds random lens, shows a bit different picture... 
All in all, I think in the end we will just need to test this with Painter....
Mirek
I wish I came to think of this benchmarking pitfall... I mean the branch prediction. Well, I agree that we need to put it in the BufferPainter environment for real test.
Meanwhile, as you worked on 7, 7a and 8, I prepared 3T2, which avoids the switch and uses ifs instead. Funnily, your 7a does the same, but with table offsets. 
void inline Fill3T2(dword *b, dword data, int len){
if(len<4){
if(len&1) *b++ = data;
if(len&2){ *b++ = data; *b++ = data; }
return;
}
__m128i q = _mm_set1_epi32(*(int*)&data);
__m128i *w = (__m128i*)b;
if(len >= 32) {
__m128i *e = (__m128i*)b + (len>>2) - 8;
if(len > 4*1024*1024 / 4 && ((uintptr_t)w & 3) == 0) { // for really huge data, bypass the cache
_mm_storeu_si128(w, q); // Head align
int s=(-((int)((uintptr_t)b)>>2))&0x3;
w = (__m128i*) (b + s);
do {
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}while(w<=e);
_mm_sfence();
}
else
do {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}while(w<=e);
}
if(len & 16) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 8) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 4) {
_mm_storeu_si128(w, q);
}
_mm_storeu_si128((__m128i*) (b + len - 4), q); // Tail align
}
I really like the w++ incremental pointer logic over the Set4(pointer+offset). This approach seems to give a small improvement on my system.
Next, I will test your 7 + 7a and report against 3T2.
But seriously, we need to put an end to this madness! The bang for the buck is rapidly decreasing as working hours are increasing...
Best regards,
Tom
|
|
|
|
Re: BufferPainter::Clear() optimization [message #53985 is a reply to message #53984] |
Tue, 19 May 2020 13:18   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
[quote title=mirek wrote on Tue, 19 May 2020 13:45]Tom1 wrote on Tue, 19 May 2020 12:35mirek wrote on Tue, 19 May 2020 10:49
I really like the w++ incremental pointer logic over the Set4(pointer+offset). This approach seems to give a small improvement on my system.
Compiler actually converts that to offsets anyway... (I have checked disassembly).
Quote:
But seriously, we need to put an end to this madness! The bang for the buck is rapidly decreasing as working hours are increasing...
Well, you have started it 
Mirek
I admit to it! My fault... 
Anyway, pick you choice: 7a or 3T2, but note that MSBT19 (32bit I mean) likes 3T2 better on short transfers. CLANG, CLANGx64 and MSBT19x64 are happy with both. (But, please do your own benchmarks, as this is just my repeated scan through different lengths with the pitfall you pointed out earlier.)
Best regards,
Tom
|
|
|
|
Re: BufferPainter::Clear() optimization [message #53989 is a reply to message #53986] |
Wed, 20 May 2020 01:34   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi Mirek,
Yes, I'm nuts... still working at this hour.
Anyway, here's a new version - Fill3T3 - that can actually handle all alignment variations (even those not handled by 7a). Please benchmark and check for correctness:
never_inline void FillStream(dword *b, dword data, int len){
while((uintptr_t)b & 15){ // Try to align
*b++=data;
len--;
};
__m128i *w = (__m128i *)b;
__m128i q = _mm_set1_epi32((int)data);
if(len>=16){
__m128i *e = w + (len>>2) - 3;
do{
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}while(w<e);
}
if(len & 8) {
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}
if(len & 4) {
_mm_stream_si128(w++, q);
}
_mm_sfence();
_mm_storeu_si128((__m128i*)(b + len - 4), q); // Tail align
}
void inline Fill3T3(dword *b, dword data, int len){
if(len<4){
if(len&1) *b++ = data;
if(len&2){ *b++ = data; *b++ = data; }
return;
}
__m128i *w = (__m128i *)b;
__m128i q = _mm_set1_epi32((int)data);
if(len >= 32) {
if(len>1024*1024 && (((uintptr_t)b & 3)==0)){
FillStream(b,data,len);
return;
}
__m128i *e = w + (len>>2) - 7;
do{
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}while(w<e);
}
if(len & 16) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 8) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 4) {
_mm_storeu_si128(w++, q);
}
_mm_storeu_si128((__m128i*)(b + len - 4), q); // Tail align
}
Best regards,
Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #53990 is a reply to message #53989] |
Wed, 20 May 2020 01:52   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
[quote title=Tom1 wrote on Wed, 20 May 2020 01:34]Hi Mirek,
Yes, I'm nuts... still working at this hour.
Anyway, here's a new version - Fill3T3 - that can actually handle all alignment variations (even those not handled by 7a). Please benchmark and check for correctness:
if(len & 8) {
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}
if(len & 4) {
_mm_stream_si128(w++, q);
}
Yeah, I think that after filling 8MB of data, this will really have impact compared to trivial loop 
Mirek
|
|
|
|
|
|
Re: BufferPainter::Clear() optimization [message #53995 is a reply to message #53994] |
Wed, 20 May 2020 10:55   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi,
No, Sorry... I'll take that alarm back. There is no error in 3T3 after all. My copy of the code inside Painter was faulty... Now I took the correct version and it is all good now.
I'm just too tired after not sleeping too much lately...
Best regards,
Tom
|
|
|
|
Goto Forum:
Current Time: Tue May 13 01:39:09 CEST 2025
Total time taken to generate the page: 0.01142 seconds
|