Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)
Re: BufferPainter::Clear() optimization [message #53997 is a reply to message #53996] Wed, 20 May 2020 12:23 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
OK, after retesting, I think it might be at most 3% faster. Looking at fillers, I think there is much more time spent in AlphaBlend function - even if it is just for segment start/end pixels. Perhaps that one should be SSE2 optimized? Smile

Mirek
Re: BufferPainter::Clear() optimization [message #53998 is a reply to message #53997] Wed, 20 May 2020 12:41 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi Mirek,

Two things to consider before you go with 7a:

- 7a crashes on unaligned buffers (t&3) while 3T3 handles them all.
- 3T3 is faster on MSBT19 with short transfers up to 50-60 dwords.

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #53999 is a reply to message #53997] Wed, 20 May 2020 12:52 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
mirek wrote on Wed, 20 May 2020 13:23
OK, after retesting, I think it might be at most 3% faster. Looking at fillers, I think there is much more time spent in AlphaBlend function - even if it is just for segment start/end pixels. Perhaps that one should be SSE2 optimized? Smile

Mirek


Hi,

My SSE2 battery is now 'discharged' for a while.... Need to recharge before next use. Smile

I also did some testing on span filler with memcpy. This is based on using IMAGE_OPAQUE of the image being rendered. It does improve the speed somewhat, but the edges cause a problem since the edge is alpha blended even if FILL_FAST is specified. So, this needs some reconsideration and better knowledge on the Painter internals (i.e. beyond my level...):

BufferPainter.h:

struct SpanSource {
	int kind;
	SpanSource(){
		kind = IMAGE_OPAQUE;
	}
	virtual void Get(RGBA *span, int x, int y, unsigned len) = 0;
	virtual ~SpanSource() {}
};

Fillers.cpp:

void SpanFiller::Render(int val, int len)
{
	if(val == 0) {
		t += len;
		s += len;
		return;
	}
	if(alpha != 256)
		val = alpha * val >> 8;

	if(val == 256) {
		if(ss->kind==IMAGE_OPAQUE) memcpy(t,s,len*sizeof(RGBA)); // apex_memcpy() would be even faster
		else{
			for(int i = 0; i < len; i++) {
				if(s[i].a == 255)
					t[i] = s[i];
				else
					AlphaBlend(t[i], s[i]);
			}
		}
		t += len;
		s += len;
	}
	else {
		const RGBA *e = t + len;
		while(t < e)
			AlphaBlendCover8(*t++, *s++, val);
	}
}

Painter/Image.cpp:


struct PainterImageSpan : SpanSource, PainterImageSpanData {
	LinearInterpolator interpolator;

	PainterImageSpan(const PainterImageSpanData& f)
	:	PainterImageSpanData(f) {
		interpolator.Set(xform);
		kind = image.GetKindNoScan(); // Add this
	}


Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #54000 is a reply to message #53998] Wed, 20 May 2020 12:53 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
I was aware about unaligned problem, thats fixed in final version. That said, unaligned in general should be considered illegal anyway, because otherwise hell will broke lose with Armv6....
Re: BufferPainter::Clear() optimization [message #54002 is a reply to message #54000] Wed, 20 May 2020 13:01 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Quote:
I was aware about unaligned problem, thats fixed in final version. That said, unaligned in general should be considered illegal anyway, because otherwise hell will broke lose with Armv6....


But that's good to know. In this case we could drop (t&3) code entirely from 3T3 and improve instruction cache locality for even better results on short transfers.

((Is there a way to 'cleanly crash' (whatever that might mean) an application attempting unaligned memset? Now it just disappears from the process list at least on Windows.))

EDIT: Let me rephrase it: Is there a way to check during development that an application will never use unaligned memset?

Best regards,

Tom

[Updated on: Wed, 20 May 2020 13:09]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54003 is a reply to message #54002] Wed, 20 May 2020 15:18 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Wed, 20 May 2020 13:01

EDIT: Let me rephrase it: Is there a way to check during development that an application will never use unaligned memset?


memsetd!

Yes, put ASSERT(((uintptr_t)t & 3) == 0); to memsetd Smile

Mirek
Re: BufferPainter::Clear() optimization [message #54004 is a reply to message #53998] Wed, 20 May 2020 15:58 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Wed, 20 May 2020 12:41

- 3T3 is faster on MSBT19 with short transfers up to 50-60 dwords.


Interestingly, adding "inline" to it seems to fix the problem... Smile For some reason, 32-bit MSC does not inline it unless you ask it to do so...

In fact, assembler for both inlined function is virtually the same, the only difference is different tail handling (IMO, mine is 1% esier on eye):

7a:

0017940D  cmp ecx,byte +0x4 
00179410  jnl 0x17942f 
00179412  test cl,0x2 
00179415  jz 0x17941f 
00179417  mov [eax+0x4],edi 
0017941A  mov [eax],edi 
0017941C  add eax,byte +0x8 
0017941F  test cl,0x1 
00179422  jz dword 0x1794b4 
00179428  mov [eax],edi 
0017942A  jmp dword 0x1794b4 
0017942F  movd xmm0,edi 
00179433  pshufd xmm0,xmm0,0x0 
00179438  movups [eax+ecx*4-0x10],xmm0     <<= tail handling
0017943D  cmp ecx,byte +0x20 
00179440  jl 0x179486 
00179442  cmp ecx,0x100000 
00179448  jl 0x179457 
0017944A  push ecx 
0017944B  push edi 
0017944C  push eax 
0017944D  call dword 0x14ff88 
00179452  add esp,byte +0xc 
00179455  jmp short 0x1794b4 
00179457  lea edx,[ecx-0x20] 
0017945A  lea edx,[eax+edx*4] 
0017945D  nop dword [eax] 
00179460  movups [eax],xmm0 
00179463  movups [eax+0x10],xmm0 
00179467  movups [eax+0x20],xmm0 
0017946B  movups [eax+0x30],xmm0 
0017946F  movups [eax+0x40],xmm0 
00179473  movups [eax+0x50],xmm0 
00179477  movups [eax+0x60],xmm0 
0017947B  movups [eax+0x70],xmm0 
0017947F  sub eax,byte -0x80 
00179482  cmp eax,edx 
00179484  jna 0x179460 
00179486  test cl,0x10 
00179489  jz 0x17949d 
0017948B  movups [eax],xmm0 
0017948E  movups [eax+0x10],xmm0 
00179492  movups [eax+0x20],xmm0 
00179496  movups [eax+0x30],xmm0 
0017949A  add eax,byte +0x40 
0017949D  test cl,0x8 
001794A0  jz 0x1794ac 
001794A2  movups [eax],xmm0 
001794A5  movups [eax+0x10],xmm0 
001794A9  add eax,byte +0x20 
001794AC  test cl,0x4 
001794AF  jz 0x1794b4 
001794B1  movups [eax],xmm0 


3T3
00179540  cmp eax,byte +0x4 
00179543  jnl 0x179560 
00179545  test al,0x1 
00179547  jz 0x17954e 
00179549  mov [edx],edi 
0017954B  add edx,byte +0x4 
0017954E  test al,0x2 
00179550  jz dword 0x179607 
00179556  mov [edx],edi 
00179558  mov [edx+0x4],edi 
0017955B  jmp dword 0x179607 
00179560  movd xmm0,edi 
00179564  mov ecx,edx 
00179566  pshufd xmm0,xmm0,0x0 
0017956B  cmp eax,byte +0x20 
0017956E  jl 0x1795c6 
00179570  cmp eax,0x100000 
00179575  jng 0x179589 
00179577  test dl,0x3 
0017957A  jnz 0x179589 
0017957C  push eax 
0017957D  push edi 
0017957E  push edx 
0017957F  call dword 0x14ff88 
00179584  add esp,byte +0xc 
00179587  jmp short 0x179604 
00179589  mov edi,eax 
0017958B  sar edi,0x2 
0017958E  sub edi,byte +0x7 
00179591  shl edi,0x4 
00179594  add edi,edx 
00179596  mov eax,ecx 
00179598  movups [eax],xmm0 
0017959B  lea eax,[ecx+0x70] 
0017959E  movups [ecx+0x10],xmm0 
001795A2  movups [ecx+0x20],xmm0 
001795A6  movups [ecx+0x30],xmm0 
001795AA  movups [ecx+0x40],xmm0 
001795AE  movups [ecx+0x50],xmm0 
001795B2  movups [ecx+0x60],xmm0 
001795B6  sub ecx,byte -0x80 
001795B9  movups [eax],xmm0 
001795BC  cmp ecx,edi 
001795BE  jc 0x179596 
001795C0  mov eax,[ebp-0x14] 
001795C3  mov edi,[ebp-0x18] 
001795C6  test al,0x10 
001795C8  jz 0x1795e3 
001795CA  mov eax,ecx 
001795CC  movups [eax],xmm0 
001795CF  lea eax,[ecx+0x30] 
001795D2  movups [ecx+0x10],xmm0 
001795D6  movups [ecx+0x20],xmm0 
001795DA  add ecx,byte +0x40 
001795DD  movups [eax],xmm0 
001795E0  mov eax,[ebp-0x14] 
001795E3  test al,0x8 
001795E5  jz 0x1795f8 
001795E7  mov eax,ecx 
001795E9  movups [eax],xmm0 
001795EC  lea eax,[ecx+0x10] 
001795EF  add ecx,byte +0x20 
001795F2  movups [eax],xmm0 
001795F5  mov eax,[ebp-0x14] 
001795F8  test al,0x4 
001795FA  jz 0x1795ff 
001795FC  movups [ecx],xmm0 
001795FF  movups [edx+eax*4-0x10],xmm0     <= TAIL


EDIT: OK, now rechecking it, it looks like 3T3 has a bit more instructions doing weird things....

[Updated on: Wed, 20 May 2020 16:01]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54005 is a reply to message #54004] Wed, 20 May 2020 16:15 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

This is strange, since I immediately added the inline to 7a when I started testing it. (I found out earlier that MSBT19 did not do it for me.) Now I did a new run and the result is in the attached csv.

Can you post the latest 7a if it is any different compared to the one posted here above?

Best regards,

Tom
  • Attachment: memset.csv
    (Size: 1.95KB, Downloaded 135 times)
Re: BufferPainter::Clear() optimization [message #54006 is a reply to message #54005] Wed, 20 May 2020 17:16 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Wed, 20 May 2020 16:15
Hi,

This is strange, since I immediately added the inline to 7a when I started testing it. (I found out earlier that MSBT19 did not do it for me.) Now I did a new run and the result is in the attached csv.

Can you post the latest 7a if it is any different compared to the one posted here above?

Best regards,

Tom


It is now in trunk as memsetd....

Mirek
Re: BufferPainter::Clear() optimization [message #54007 is a reply to message #54006] Wed, 20 May 2020 17:31 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
I am getting quite different picture:

	int bsize=8*1024*1024;
	Buffer<dword> b(bsize, 0);
	dword cw = 123;

	String result="\"N\",\"memsetd()\",\"Fill3T3()\"\r\n";
	for(int len=1;len<=bsize;){
		int maximum=100000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++)
			memsetd(~b, cw, len);
		int64 t1=usecs();
		for(int i = 0; i < maximum; i++)
			Fill3T3(~b, cw, len);
		int64 t2=usecs();
		String r = Format("%d,%f,%f",len,1000.0*(t1-t0)/maximum,1000.0*(t2-t1)/maximum);
		RLOG(r);
		result.Cat(r + "\r\n");
		if(len<64) len++;
		else len*=2;
	}
	
	SaveFile(GetHomeDirFile("memset.csv"),result);


I am starting to wonder if there is difference between our MSC 32bit compilers...
  • Attachment: memset.csv
    (Size: 1.90KB, Downloaded 132 times)
Re: BufferPainter::Clear() optimization [message #54008 is a reply to message #54007] Wed, 20 May 2020 17:37 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Ha, funny. It depends on order of functions tested. If I test memsetd second, I am getting different numbers Smile
Re: BufferPainter::Clear() optimization [message #54010 is a reply to message #54007] Wed, 20 May 2020 19:51 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
mirek wrote on Wed, 20 May 2020 18:31
I am getting quite different picture:

	int bsize=8*1024*1024;
	Buffer<dword> b(bsize, 0);
	dword cw = 123;

	String result="\"N\",\"memsetd()\",\"Fill3T3()\"\r\n";
	for(int len=1;len<=bsize;){
		int maximum=100000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++)
			memsetd(~b, cw, len);
		int64 t1=usecs();
		for(int i = 0; i < maximum; i++)
			Fill3T3(~b, cw, len);
		int64 t2=usecs();
		String r = Format("%d,%f,%f",len,1000.0*(t1-t0)/maximum,1000.0*(t2-t1)/maximum);
		RLOG(r);
		result.Cat(r + "\r\n");
		if(len<64) len++;
		else len*=2;
	}
	
	SaveFile(GetHomeDirFile("memset.csv"),result);


I am starting to wonder if there is difference between our MSC 32bit compilers...


Hi,

No wonder we ended up with (very slightly) different approach... Your results are more or less reversed to what I'm getting. I tried to reorder the calls too, but without any observable difference.

It's either the different CPUs or a different compiler. My compiler is:

Microsoft (R) C/C++ Optimizing Compiler Version 19.21.27702.2 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.


Should I downgrade or upgrade?...

Anyway, seriously I'm pleased with the final result here. The filler is now better than anything before and can be used generally for all clearing/presetting of buffers. I use this a lot in signal processing in addition to clearing the ImageBuffer for BufferPainter. After all, the ImageBuffer needs to be cleared or preset to user preference background color once before each display update. It is much better to have a 1.5 ms delay instead of 3.6 ms delay before drawing approximately 10-20 ms worth of vector map data on the screen. Smile

Should this new memsetd() now be deployed all over the u++? I mean e.g. Core/Topt.h :: Fill?

Thank you a lot for your great work on this! Smile

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #54011 is a reply to message #54010] Thu, 21 May 2020 09:04 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
[quote title=Tom1 wrote on Wed, 20 May 2020 19:51]mirek wrote on Wed, 20 May 2020 18:31

Should this new memsetd() now be deployed all over the u++? I mean e.g. Core/Topt.h :: Fill?


IDK, maybe as specialisation...

Mirek
Re: BufferPainter::Clear() optimization [message #54014 is a reply to message #54011] Thu, 21 May 2020 13:28 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
OK, so I could not stop digging and found last important ingredient: alignment matters!

void FillX(void *p, dword data, int len)
{
	dword *t = (dword *)p;
	if(len < 4) {
		if(len & 2) {
			t[0] = t[1] = t[len - 1] = data;
			return;
		}
		if(len & 1)
			t[0] = data;
		return;
	}

	__m128i val4 = _mm_set1_epi32(data);
	auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };

	Set4(len - 4); // fill tail
	if(len >= 16) {
		Set4(0); // align up on 16 bytes boundary
		const dword *e = t + len;
		t = (dword *)(((uintptr_t)t | 15) + 1);
		len = e - t;
		e -= 16;
		if(len >= 1024*1024) { // for really huge data, bypass the cache
			huge_memsetd(t, data, len);
			return;
		}
		while(t <= e) {
			Set4(0); Set4(4); Set4(8); Set4(12);
			t += 16;
		}
	}
	if(len & 8) {
		Set4(0); Set4(4);
		t += 8;
	}
	if(len & 4)
		Set4(0);
}


This is about twice as fast as Fill7a for len > 60 (up to cache bypass limit).
Re: BufferPainter::Clear() optimization [message #54017 is a reply to message #54003] Thu, 21 May 2020 16:21 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
mirek wrote on Wed, 20 May 2020 16:18
Tom1 wrote on Wed, 20 May 2020 13:01

EDIT: Let me rephrase it: Is there a way to check during development that an application will never use unaligned memset?


memsetd!

Yes, put ASSERT(((uintptr_t)t & 3) == 0); to memsetd Smile

Mirek


Good point! Please do!

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #54018 is a reply to message #54014] Thu, 21 May 2020 16:38 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

This new FillX is incredibly elegant! Congratulations Mirek! I really do like your new findings there. You just need to rename it as memsetd() and place in the correct header in Core... Smile

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #54021 is a reply to message #54018] Thu, 21 May 2020 17:51 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3355
Registered: August 2008
Senior Veteran
Thank you all for your job. Although please review this in Redmine.

Best regards
Iñaki
Re: BufferPainter::Clear() optimization [message #54022 is a reply to message #54021] Thu, 21 May 2020 19:22 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi Koldo,

I checked and #include <emmintrin.h> seems to work just fine for what we are working on. Thanks for pointing this out.

Mirek: Agree?

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #54023 is a reply to message #54022] Thu, 21 May 2020 19:25 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Mirek,

I just found that there is a sweet spot at ~0x3f alignment (i.e. 64 bytes) on my CPU. This is presumably the L1 cache line length, if I'm not mistaken.

Best regards,

Tom

EDIT: It just looks that I cannot squeeze the benefit out as re-alignment code tends to eat what would could possibly be achieved here. However, if allocator could allocate large blocks at even 64 byte limits, that could improve performance behind the scenes.

[Updated on: Thu, 21 May 2020 23:46]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54026 is a reply to message #54023] Fri, 22 May 2020 09:32 Go to previous messageGo to previous message
Didier is currently offline  Didier
Messages: 680
Registered: November 2008
Location: France
Contributor
Hello mirek ans Tom,
Grenat work hère but I have une simple question: what is the point with cache ?
Normally cache speeds things up when you need to reaccess data just After writing it.
So filling a buffer with a constant value that is not read immediatly After in most cases isn't a corresponding use case.
So, I think that having a fill function that doesn't use cache at all will benefit in two points:
Timing stability and more importantly, cache is not touched so it can speed up other functions calls further
Previous Topic: Should we still care about big-endian CPUs?
Next Topic: TheIDE crash after switching package
Goto Forum:
  


Current Time: Thu Mar 28 20:49:23 CET 2024

Total time taken to generate the page: 0.01502 seconds