Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53974 is a reply to message #53973] |
Mon, 18 May 2020 21:40   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi Mirek,
Something like this, maybe... I'm not quite sure as this method reports 16M cache for me -- although this works quite well for me:
static int cachesize=999;
INITBLOCK{
#ifdef COMPILER_MSC
int cpuInfo[4];
Zero(cpuInfo);
__cpuid(cpuInfo, 0x80000006);
#else
unsigned int cpuInfo[4];
Zero(cpuInfo);
__get_cpuid(0x80000006, &cpuInfo[0], &cpuInfo[1], &cpuInfo[2], &cpuInfo[3]);
#endif
cachesize=1024*(cpuInfo[2]>>16)*(cpuInfo[2]&0xff);
};
void inline Fill3T(void *b, dword data, int len){
switch(len){
case 3: ((dword *)b)[2] = data;
case 2: ((dword *)b)[1] = data;
case 1: ((dword *)b)[0] = data;
case 0: return;
}
__m128i q = _mm_set1_epi32(*(int*)&data);
__m128i *w = (__m128i*)b;
if(len >= 32) {
__m128i *e = (__m128i*)b + (len>>2) - 8;
if(len >= (cachesize>>2) && ((uintptr_t)w & 3) == 0) { // for really huge data, bypass the cache
_mm_storeu_si128(w, q); // Head align
int s=(-((int)((uintptr_t)b)>>2))&0x3;
w = (__m128i*) ((dword*)b) + s;
do {
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}while(w<=e);
_mm_sfence();
}
else
do {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}while(w<=e);
}
if(len & 16) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 8) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 4) {
_mm_storeu_si128(w, q);
}
_mm_storeu_si128((__m128i*) (((dword*)b) + len - 4), q); // Tail align
}
Best regards,
Tom
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53975 is a reply to message #53972] |
Mon, 18 May 2020 21:56   |
 |
mirek
Messages: 14291 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Mon, 18 May 2020 20:57Hi,
My CPU here is Intel(R) Core(TM) i7-4790K:
https://ark.intel.com/content/www/us/en/ark/products/80807/i ntel-core-i7-4790k-processor-8m-cache-up-to-4-40-ghz.html
Not surprisingly, they say it has an 8M 'smart cache'.
Please find attached two CSV files portraying execution time in ns for each call in average. The length is in dwords. Fill3a is there for reference and Fill3T is using 64 dword threshold for streaming in one and 2M dword (8MB) threshold in the other file. While not portrayed here, increasing the threshold above 32MB decreases the performance from 1.5 ms to 3.6 ms for a 32 MB buffer.
Best regards,
Tom
If I interpret these numbers correctly, it looks like around 4MB potential drop because of cache bypass starts to be diminish, right?
Thing is, I am afraid that making this dynamic will cause a lot of problems, starting with perfromance - it is after all another read from the memory. I would settle for some compromise constant there. Like 4MB... 
Mirek
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53977 is a reply to message #53751] |
Tue, 19 May 2020 00:02   |
 |
mirek
Messages: 14291 Registered: November 2005
|
Ultimate Member |
|
|
What about this:
never_inline
void HugeFill(dword *t, dword c, int len)
{
__m128i val4 = _mm_set1_epi32(*(int*)&c);
auto Set4S = [&](int at) { _mm_stream_si128((__m128i *)(t + at), val4); };
while((uintptr_t)t & 15) { // align to 16 bytes for SSE
*t++ = c;
len--;
}
while(len >= 16) {
Set4S(0);
Set4S(4);
Set4S(8);
Set4S(12);
t += 16;
len -= 16;
}
while(len--)
*t++ = c;
_mm_sfence();
}
void Fill6(dword *t, dword c, int len)
{
if(len >= 4) {
__m128i val4 = _mm_set1_epi32(*(int*)&c);
auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
if(len > 4*1024*1024 / 4) {
HugeFill(t, c, len);
return;
}
while(len >= 16) {
Set4(0);
Set4(4);
Set4(8);
Set4(12);
t += 16;
len -= 16;
}
if(len & 8) {
Set4(0);
Set4(4);
t += 8;
}
if(len & 4) {
Set4(0);
t += 4;
}
}
if(len & 3)
t[0] = t[(len & 2) >> 1] = t[(len & 2) & ((len & 1) << 1)] = c;
}
[Updated on: Tue, 19 May 2020 09:01] Report message to a moderator
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53980 is a reply to message #53979] |
Tue, 19 May 2020 09:14   |
 |
mirek
Messages: 14291 Registered: November 2005
|
Ultimate Member |
|
|
Yeah, there was another bug in it, I should test more before posting.
In retrospective, while the trick is nice, I do not think it is worth it. But if you wanted to experiment with this path, I have found the way how to extend / simplify this. The basic idea is
int nlen = -len;
t[1 & HIBYTE(nlen)] = c;
nlen++;
t[2 & HIBYTE(nlen)] = c;
nlen++;
t[3 & HIBYTE(nlen)] = c;
....
(at some point, nlen will become > 0 and thus HIBYTE goes from 0xff to 0x00, thus "grounding" indices).
Also, I would like to try to explain why I am trying to beat Fill3T. It is about those switches, while
switch(len) {
case 0:
case 1:
case 2:
default:
}
looks magnificent, it is actually 2 "unstable" branch predictions and quite a bit of code to compute the target address. So
if(len & 2) {
}
if(len & 1) {
}
should be on par - 2 branch predictions and maybe a bit less of code....
Mirek
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53982 is a reply to message #53751] |
Tue, 19 May 2020 11:32   |
 |
mirek
Messages: 14291 Registered: November 2005
|
Ultimate Member |
|
|
Three more variants, based on your FillT. Fill7 is basically identical, with a little trick added (hope you like it). Fill7a has different "frontend". Fill8 is not performing very well, adding that just so that you know I have tested that variant too... 
Fill7 and Fill7a seem to be basically equal and maybe just a tiny bit faster than Fill3T....
void Fill7(dword *t, dword data, int len){
switch(len) {
case 3: t[2] = data;
case 2: t[1] = data;
case 1: t[0] = data;
case 0: return;
}
__m128i val4 = _mm_set1_epi32(data);
auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
Set4(len - 4); // fill tail
if(len >= 32) {
if(len >= 1024*1024) { // for really huge data, bypass the cache
HugeFill(t, data, len);
return;
}
const dword *e = t + len - 32;
do {
Set4(0); Set4(4); Set4(8); Set4(12);
Set4(16); Set4(20); Set4(24); Set4(28);
t += 32;
}
while(t <= e);
}
if(len & 16) {
Set4(0); Set4(4); Set4(8); Set4(12);
t += 16;
}
if(len & 8) {
Set4(0); Set4(4);
t += 8;
}
if(len & 4)
Set4(0);
}
void Fill7a(dword *t, dword data, int len){
if(len < 4) {
if(len & 2) {
t[0] = t[1] = data;
t += 2;
}
if(len & 1)
t[0] = data;
return;
}
__m128i val4 = _mm_set1_epi32(data);
auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
Set4(len - 4); // fill tail
if(len >= 32) {
if(len >= 1024*1024) { // for really huge data, bypass the cache
HugeFill(t, data, len);
return;
}
const dword *e = t + len - 32;
do {
Set4(0); Set4(4); Set4(8); Set4(12);
Set4(16); Set4(20); Set4(24); Set4(28);
t += 32;
}
while(t <= e);
}
if(len & 16) {
Set4(0); Set4(4); Set4(8); Set4(12);
t += 16;
}
if(len & 8) {
Set4(0); Set4(4);
t += 8;
}
if(len & 4)
Set4(0);
}
void Fill8(dword *t, dword data, int len){
switch(len) {
case 3: t[2] = data;
case 2: t[1] = data;
case 1: t[0] = data;
case 0: return;
}
__m128i val4 = _mm_set1_epi32(data);
auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
Set4(len - 4); // fill tail
if(len >= 32) {
if(len >= 1024*1024) { // for really huge data, bypass the cache
HugeFill(t, data, len);
return;
}
int cnt = len >> 5;
do {
Set4(0); Set4(4); Set4(8); Set4(12);
len -= 32;
Set4(16); Set4(20); Set4(24); Set4(28);
t += 32;
}
while(len >= 32);
}
switch((len >> 2) & 7) {
case 7: Set4(24);
case 6: Set4(20);
case 5: Set4(16);
case 4: Set4(12);
case 3: Set4(8);
case 2: Set4(4);
case 1: Set4(0);
}
}
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53983 is a reply to message #53981] |
Tue, 19 May 2020 12:35   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
mirek wrote on Tue, 19 May 2020 10:49Also, a little note about your testing code: You loop over the same "len" many times and measure that. The problem is that first pass setups branch prediction so all other passes are predicted. If "len" is changing, prediction fails and you might get different results....
Which explains why my tests, which feeds random lens, shows a bit different picture... 
All in all, I think in the end we will just need to test this with Painter....
Mirek
I wish I came to think of this benchmarking pitfall... I mean the branch prediction. Well, I agree that we need to put it in the BufferPainter environment for real test.
Meanwhile, as you worked on 7, 7a and 8, I prepared 3T2, which avoids the switch and uses ifs instead. Funnily, your 7a does the same, but with table offsets. 
void inline Fill3T2(dword *b, dword data, int len){
if(len<4){
if(len&1) *b++ = data;
if(len&2){ *b++ = data; *b++ = data; }
return;
}
__m128i q = _mm_set1_epi32(*(int*)&data);
__m128i *w = (__m128i*)b;
if(len >= 32) {
__m128i *e = (__m128i*)b + (len>>2) - 8;
if(len > 4*1024*1024 / 4 && ((uintptr_t)w & 3) == 0) { // for really huge data, bypass the cache
_mm_storeu_si128(w, q); // Head align
int s=(-((int)((uintptr_t)b)>>2))&0x3;
w = (__m128i*) (b + s);
do {
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}while(w<=e);
_mm_sfence();
}
else
do {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}while(w<=e);
}
if(len & 16) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 8) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 4) {
_mm_storeu_si128(w, q);
}
_mm_storeu_si128((__m128i*) (b + len - 4), q); // Tail align
}
I really like the w++ incremental pointer logic over the Set4(pointer+offset). This approach seems to give a small improvement on my system.
Next, I will test your 7 + 7a and report against 3T2.
But seriously, we need to put an end to this madness! The bang for the buck is rapidly decreasing as working hours are increasing...
Best regards,
Tom
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53985 is a reply to message #53984] |
Tue, 19 May 2020 13:18   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
[quote title=mirek wrote on Tue, 19 May 2020 13:45]Tom1 wrote on Tue, 19 May 2020 12:35mirek wrote on Tue, 19 May 2020 10:49
I really like the w++ incremental pointer logic over the Set4(pointer+offset). This approach seems to give a small improvement on my system.
Compiler actually converts that to offsets anyway... (I have checked disassembly).
Quote:
But seriously, we need to put an end to this madness! The bang for the buck is rapidly decreasing as working hours are increasing...
Well, you have started it 
Mirek
I admit to it! My fault... 
Anyway, pick you choice: 7a or 3T2, but note that MSBT19 (32bit I mean) likes 3T2 better on short transfers. CLANG, CLANGx64 and MSBT19x64 are happy with both. (But, please do your own benchmarks, as this is just my repeated scan through different lengths with the pitfall you pointed out earlier.)
Best regards,
Tom
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53989 is a reply to message #53986] |
Wed, 20 May 2020 01:34   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi Mirek,
Yes, I'm nuts... still working at this hour.
Anyway, here's a new version - Fill3T3 - that can actually handle all alignment variations (even those not handled by 7a). Please benchmark and check for correctness:
never_inline void FillStream(dword *b, dword data, int len){
while((uintptr_t)b & 15){ // Try to align
*b++=data;
len--;
};
__m128i *w = (__m128i *)b;
__m128i q = _mm_set1_epi32((int)data);
if(len>=16){
__m128i *e = w + (len>>2) - 3;
do{
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}while(w<e);
}
if(len & 8) {
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}
if(len & 4) {
_mm_stream_si128(w++, q);
}
_mm_sfence();
_mm_storeu_si128((__m128i*)(b + len - 4), q); // Tail align
}
void inline Fill3T3(dword *b, dword data, int len){
if(len<4){
if(len&1) *b++ = data;
if(len&2){ *b++ = data; *b++ = data; }
return;
}
__m128i *w = (__m128i *)b;
__m128i q = _mm_set1_epi32((int)data);
if(len >= 32) {
if(len>1024*1024 && (((uintptr_t)b & 3)==0)){
FillStream(b,data,len);
return;
}
__m128i *e = w + (len>>2) - 7;
do{
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}while(w<e);
}
if(len & 16) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 8) {
_mm_storeu_si128(w++, q);
_mm_storeu_si128(w++, q);
}
if(len & 4) {
_mm_storeu_si128(w++, q);
}
_mm_storeu_si128((__m128i*)(b + len - 4), q); // Tail align
}
Best regards,
Tom
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53990 is a reply to message #53989] |
Wed, 20 May 2020 01:52   |
 |
mirek
Messages: 14291 Registered: November 2005
|
Ultimate Member |
|
|
[quote title=Tom1 wrote on Wed, 20 May 2020 01:34]Hi Mirek,
Yes, I'm nuts... still working at this hour.
Anyway, here's a new version - Fill3T3 - that can actually handle all alignment variations (even those not handled by 7a). Please benchmark and check for correctness:
if(len & 8) {
_mm_stream_si128(w++, q);
_mm_stream_si128(w++, q);
}
if(len & 4) {
_mm_stream_si128(w++, q);
}
Yeah, I think that after filling 8MB of data, this will really have impact compared to trivial loop 
Mirek
|
|
|
|
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53995 is a reply to message #53994] |
Wed, 20 May 2020 10:55   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi,
No, Sorry... I'll take that alarm back. There is no error in 3T3 after all. My copy of the code inside Painter was faulty... Now I took the correct version and it is all good now.
I'm just too tired after not sleeping too much lately...
Best regards,
Tom
|
|
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53998 is a reply to message #53997] |
Wed, 20 May 2020 12:41   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi Mirek,
Two things to consider before you go with 7a:
- 7a crashes on unaligned buffers (t&3) while 3T3 handles them all.
- 3T3 is faster on MSBT19 with short transfers up to 50-60 dwords.
Best regards,
Tom
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #53999 is a reply to message #53997] |
Wed, 20 May 2020 12:52   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
mirek wrote on Wed, 20 May 2020 13:23OK, after retesting, I think it might be at most 3% faster. Looking at fillers, I think there is much more time spent in AlphaBlend function - even if it is just for segment start/end pixels. Perhaps that one should be SSE2 optimized? 
Mirek
Hi,
My SSE2 battery is now 'discharged' for a while.... Need to recharge before next use. 
I also did some testing on span filler with memcpy. This is based on using IMAGE_OPAQUE of the image being rendered. It does improve the speed somewhat, but the edges cause a problem since the edge is alpha blended even if FILL_FAST is specified. So, this needs some reconsideration and better knowledge on the Painter internals (i.e. beyond my level...):
BufferPainter.h:
struct SpanSource {
int kind;
SpanSource(){
kind = IMAGE_OPAQUE;
}
virtual void Get(RGBA *span, int x, int y, unsigned len) = 0;
virtual ~SpanSource() {}
};
Fillers.cpp:
void SpanFiller::Render(int val, int len)
{
if(val == 0) {
t += len;
s += len;
return;
}
if(alpha != 256)
val = alpha * val >> 8;
if(val == 256) {
if(ss->kind==IMAGE_OPAQUE) memcpy(t,s,len*sizeof(RGBA)); // apex_memcpy() would be even faster
else{
for(int i = 0; i < len; i++) {
if(s[i].a == 255)
t[i] = s[i];
else
AlphaBlend(t[i], s[i]);
}
}
t += len;
s += len;
}
else {
const RGBA *e = t + len;
while(t < e)
AlphaBlendCover8(*t++, *s++, val);
}
}
Painter/Image.cpp:
struct PainterImageSpan : SpanSource, PainterImageSpanData {
LinearInterpolator interpolator;
PainterImageSpan(const PainterImageSpanData& f)
: PainterImageSpanData(f) {
interpolator.Set(xform);
kind = image.GetKindNoScan(); // Add this
}
Best regards,
Tom
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #54002 is a reply to message #54000] |
Wed, 20 May 2020 13:01   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
Quote:I was aware about unaligned problem, thats fixed in final version. That said, unaligned in general should be considered illegal anyway, because otherwise hell will broke lose with Armv6....
But that's good to know. In this case we could drop (t&3) code entirely from 3T3 and improve instruction cache locality for even better results on short transfers.
((Is there a way to 'cleanly crash' (whatever that might mean) an application attempting unaligned memset? Now it just disappears from the process list at least on Windows.))
EDIT: Let me rephrase it: Is there a way to check during development that an application will never use unaligned memset?
Best regards,
Tom
[Updated on: Wed, 20 May 2020 13:09] Report message to a moderator
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #54004 is a reply to message #53998] |
Wed, 20 May 2020 15:58   |
 |
mirek
Messages: 14291 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Wed, 20 May 2020 12:41
- 3T3 is faster on MSBT19 with short transfers up to 50-60 dwords.
Interestingly, adding "inline" to it seems to fix the problem... For some reason, 32-bit MSC does not inline it unless you ask it to do so...
In fact, assembler for both inlined function is virtually the same, the only difference is different tail handling (IMO, mine is 1% esier on eye):
7a:
0017940D cmp ecx,byte +0x4
00179410 jnl 0x17942f
00179412 test cl,0x2
00179415 jz 0x17941f
00179417 mov [eax+0x4],edi
0017941A mov [eax],edi
0017941C add eax,byte +0x8
0017941F test cl,0x1
00179422 jz dword 0x1794b4
00179428 mov [eax],edi
0017942A jmp dword 0x1794b4
0017942F movd xmm0,edi
00179433 pshufd xmm0,xmm0,0x0
00179438 movups [eax+ecx*4-0x10],xmm0 <<= tail handling
0017943D cmp ecx,byte +0x20
00179440 jl 0x179486
00179442 cmp ecx,0x100000
00179448 jl 0x179457
0017944A push ecx
0017944B push edi
0017944C push eax
0017944D call dword 0x14ff88
00179452 add esp,byte +0xc
00179455 jmp short 0x1794b4
00179457 lea edx,[ecx-0x20]
0017945A lea edx,[eax+edx*4]
0017945D nop dword [eax]
00179460 movups [eax],xmm0
00179463 movups [eax+0x10],xmm0
00179467 movups [eax+0x20],xmm0
0017946B movups [eax+0x30],xmm0
0017946F movups [eax+0x40],xmm0
00179473 movups [eax+0x50],xmm0
00179477 movups [eax+0x60],xmm0
0017947B movups [eax+0x70],xmm0
0017947F sub eax,byte -0x80
00179482 cmp eax,edx
00179484 jna 0x179460
00179486 test cl,0x10
00179489 jz 0x17949d
0017948B movups [eax],xmm0
0017948E movups [eax+0x10],xmm0
00179492 movups [eax+0x20],xmm0
00179496 movups [eax+0x30],xmm0
0017949A add eax,byte +0x40
0017949D test cl,0x8
001794A0 jz 0x1794ac
001794A2 movups [eax],xmm0
001794A5 movups [eax+0x10],xmm0
001794A9 add eax,byte +0x20
001794AC test cl,0x4
001794AF jz 0x1794b4
001794B1 movups [eax],xmm0
3T3
00179540 cmp eax,byte +0x4
00179543 jnl 0x179560
00179545 test al,0x1
00179547 jz 0x17954e
00179549 mov [edx],edi
0017954B add edx,byte +0x4
0017954E test al,0x2
00179550 jz dword 0x179607
00179556 mov [edx],edi
00179558 mov [edx+0x4],edi
0017955B jmp dword 0x179607
00179560 movd xmm0,edi
00179564 mov ecx,edx
00179566 pshufd xmm0,xmm0,0x0
0017956B cmp eax,byte +0x20
0017956E jl 0x1795c6
00179570 cmp eax,0x100000
00179575 jng 0x179589
00179577 test dl,0x3
0017957A jnz 0x179589
0017957C push eax
0017957D push edi
0017957E push edx
0017957F call dword 0x14ff88
00179584 add esp,byte +0xc
00179587 jmp short 0x179604
00179589 mov edi,eax
0017958B sar edi,0x2
0017958E sub edi,byte +0x7
00179591 shl edi,0x4
00179594 add edi,edx
00179596 mov eax,ecx
00179598 movups [eax],xmm0
0017959B lea eax,[ecx+0x70]
0017959E movups [ecx+0x10],xmm0
001795A2 movups [ecx+0x20],xmm0
001795A6 movups [ecx+0x30],xmm0
001795AA movups [ecx+0x40],xmm0
001795AE movups [ecx+0x50],xmm0
001795B2 movups [ecx+0x60],xmm0
001795B6 sub ecx,byte -0x80
001795B9 movups [eax],xmm0
001795BC cmp ecx,edi
001795BE jc 0x179596
001795C0 mov eax,[ebp-0x14]
001795C3 mov edi,[ebp-0x18]
001795C6 test al,0x10
001795C8 jz 0x1795e3
001795CA mov eax,ecx
001795CC movups [eax],xmm0
001795CF lea eax,[ecx+0x30]
001795D2 movups [ecx+0x10],xmm0
001795D6 movups [ecx+0x20],xmm0
001795DA add ecx,byte +0x40
001795DD movups [eax],xmm0
001795E0 mov eax,[ebp-0x14]
001795E3 test al,0x8
001795E5 jz 0x1795f8
001795E7 mov eax,ecx
001795E9 movups [eax],xmm0
001795EC lea eax,[ecx+0x10]
001795EF add ecx,byte +0x20
001795F2 movups [eax],xmm0
001795F5 mov eax,[ebp-0x14]
001795F8 test al,0x4
001795FA jz 0x1795ff
001795FC movups [ecx],xmm0
001795FF movups [edx+eax*4-0x10],xmm0 <= TAIL
EDIT: OK, now rechecking it, it looks like 3T3 has a bit more instructions doing weird things....
[Updated on: Wed, 20 May 2020 16:01] Report message to a moderator
|
|
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #54007 is a reply to message #54006] |
Wed, 20 May 2020 17:31   |
 |
mirek
Messages: 14291 Registered: November 2005
|
Ultimate Member |
|
|
I am getting quite different picture:
int bsize=8*1024*1024;
Buffer<dword> b(bsize, 0);
dword cw = 123;
String result="\"N\",\"memsetd()\",\"Fill3T3()\"\r\n";
for(int len=1;len<=bsize;){
int maximum=100000000/len;
int64 t0=usecs();
for(int i = 0; i < maximum; i++)
memsetd(~b, cw, len);
int64 t1=usecs();
for(int i = 0; i < maximum; i++)
Fill3T3(~b, cw, len);
int64 t2=usecs();
String r = Format("%d,%f,%f",len,1000.0*(t1-t0)/maximum,1000.0*(t2-t1)/maximum);
RLOG(r);
result.Cat(r + "\r\n");
if(len<64) len++;
else len*=2;
}
SaveFile(GetHomeDirFile("memset.csv"),result);
I am starting to wonder if there is difference between our MSC 32bit compilers...
-
Attachment: memset.csv
(Size: 1.90KB, Downloaded 301 times)
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #54010 is a reply to message #54007] |
Wed, 20 May 2020 19:51   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
mirek wrote on Wed, 20 May 2020 18:31I am getting quite different picture:
int bsize=8*1024*1024;
Buffer<dword> b(bsize, 0);
dword cw = 123;
String result="\"N\",\"memsetd()\",\"Fill3T3()\"\r\n";
for(int len=1;len<=bsize;){
int maximum=100000000/len;
int64 t0=usecs();
for(int i = 0; i < maximum; i++)
memsetd(~b, cw, len);
int64 t1=usecs();
for(int i = 0; i < maximum; i++)
Fill3T3(~b, cw, len);
int64 t2=usecs();
String r = Format("%d,%f,%f",len,1000.0*(t1-t0)/maximum,1000.0*(t2-t1)/maximum);
RLOG(r);
result.Cat(r + "\r\n");
if(len<64) len++;
else len*=2;
}
SaveFile(GetHomeDirFile("memset.csv"),result);
I am starting to wonder if there is difference between our MSC 32bit compilers...
Hi,
No wonder we ended up with (very slightly) different approach... Your results are more or less reversed to what I'm getting. I tried to reorder the calls too, but without any observable difference.
It's either the different CPUs or a different compiler. My compiler is:
Microsoft (R) C/C++ Optimizing Compiler Version 19.21.27702.2 for x86
Copyright (C) Microsoft Corporation. All rights reserved.
Should I downgrade or upgrade?...
Anyway, seriously I'm pleased with the final result here. The filler is now better than anything before and can be used generally for all clearing/presetting of buffers. I use this a lot in signal processing in addition to clearing the ImageBuffer for BufferPainter. After all, the ImageBuffer needs to be cleared or preset to user preference background color once before each display update. It is much better to have a 1.5 ms delay instead of 3.6 ms delay before drawing approximately 10-20 ms worth of vector map data on the screen. 
Should this new memsetd() now be deployed all over the u++? I mean e.g. Core/Topt.h :: Fill?
Thank you a lot for your great work on this! 
Best regards,
Tom
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| Re: BufferPainter::Clear() optimization [message #54023 is a reply to message #54022] |
Thu, 21 May 2020 19:25   |
Tom1
Messages: 1319 Registered: March 2007
|
Ultimate Contributor |
|
|
Mirek,
I just found that there is a sweet spot at ~0x3f alignment (i.e. 64 bytes) on my CPU. This is presumably the L1 cache line length, if I'm not mistaken.
Best regards,
Tom
EDIT: It just looks that I cannot squeeze the benefit out as re-alignment code tends to eat what would could possibly be achieved here. However, if allocator could allocate large blocks at even 64 byte limits, that could improve performance behind the scenes.
[Updated on: Thu, 21 May 2020 23:46] Report message to a moderator
|
|
|
|
|
|
Goto Forum:
Current Time: Wed May 13 01:34:25 GMT+2 2026
Total time taken to generate the page: 0.01356 seconds
|