Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)
Re: BufferPainter::Clear() optimization [message #54027 is a reply to message #54026] Fri, 22 May 2020 10:04 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
Didier wrote on Fri, 22 May 2020 09:32
Hello mirek ans Tom,
Grenat work hère but I have une simple question: what is the point with cache ?
Normally cache speeds things up when you need to reaccess data just After writing it.
So filling a buffer with a constant value that is not read immediatly After in most cases isn't a corresponding use case.
So, I think that having a fill function that doesn't use cache at all will benefit in two points:
Timing stability and more importantly, cache is not touched so it can speed up other functions calls further


Thing that started this whole issue: If you need to clear buffer for 4K screen, that is about 32MB of data. Thats definitely more than can fit into the cache. So what really happens in that in this case is that at some point cache runs out and you are significantly slowed down by CPU writing data from the cache to main memory. The "fix" is to bypass the cache in this case (we have for now established that the reasonable threshold is somewhere around 4MB).

That said, really a lot of other things were optimised thereafter, mostly on the other size of size spectrum...
Re: BufferPainter::Clear() optimization [message #54028 is a reply to message #54023] Fri, 22 May 2020 10:05 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
Tom1 wrote on Thu, 21 May 2020 19:25

EDIT: It just looks that I cannot squeeze the benefit out as re-alignment code tends to eat what would could possibly be achieved here. However, if allocator could allocate large blocks at even 64 byte limits, that could improve performance behind the scenes.


It cannot as alignment is important part of block information...

Mirek
Re: BufferPainter::Clear() optimization [message #54031 is a reply to message #54028] Fri, 22 May 2020 10:28 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
So I have implemented a bunch of other functions based on info gathered during this session:

memcpyd
svo_memset
svo_memcpy

Now I have hopefully the last problem to tune... I have tried to put svo_memcpy to Vector::Add grow routine and it indeed improved performance a bit. Then tried to improve this even more and put memcpyd (which svo_memcpy is using as backend in some situations) and performance dropped.

I believe that the problem is that memcpyd became too fat and it screws inlining. So the thing to solve now is to find how to remove some if this fat to non-inline.... (svo_memcpy already has such non-inlined part). Probably same should happend to memsetd too....
Re: BufferPainter::Clear() optimization [message #54032 is a reply to message #54028] Fri, 22 May 2020 10:29 Go to previous messageGo to next message
koldo is currently online  koldo
Messages: 3129
Registered: August 2008
Ultimate Member
One question. To use these new features, is it necessary to set compiler flags, like /arch:AVX in Visual Studio?

Best regards
Iñaki
Re: BufferPainter::Clear() optimization [message #54033 is a reply to message #54031] Fri, 22 May 2020 11:13 Go to previous messageGo to next message
Tom1
Messages: 794
Registered: March 2007
Contributor
Quote:
I believe that the problem is that memcpyd became too fat and it screws inlining. So the thing to solve now is to find how to remove some if this fat to non-inline.... (svo_memcpy already has such non-inlined part). Probably same should happend to memsetd too....


Hi Mirek,

I think this could be the same phenomenon that caused me issues with 32-bit MSC. It was more critical to code length and the short transfers suffered immediately when code size increased. At the same time MSBT19x64 and both CLANG and CLANGx64 did not experience any trouble. Perhaps MSBT19 did not do as good job with code size as the rest and on my CPU the instruction cache was exhausted. I bet the instruction cache on your CPU is larger than what my i7 has.

At some moment I was thinking of offering the functions as two variants: inline and never_inline, in a way that the never_inline is simply calling the inline. An then when the code benefits from it, calling the never_inline variant.

Then I also thought of handling something like <= 16 .. 32 sizes inline and the rest in a deeper never_inline function. This would probably improve the situation without adding so much complexity.

Best regards,

Tom

Re: BufferPainter::Clear() optimization [message #54034 is a reply to message #54032] Fri, 22 May 2020 11:32 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
koldo wrote on Fri, 22 May 2020 10:29
One question. To use these new features, is it necessary to set compiler flags, like /arch:AVX in Visual Studio?


No so far. This is just SSE2, which is enabled by default for ages now...

Of course, the next logical step is to use AVX256 Smile

Mirek

[Updated on: Fri, 22 May 2020 11:34]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54035 is a reply to message #54032] Fri, 22 May 2020 11:32 Go to previous messageGo to next message
Tom1
Messages: 794
Registered: March 2007
Contributor
koldo wrote on Fri, 22 May 2020 11:29
One question. To use these new features, is it necessary to set compiler flags, like /arch:AVX in Visual Studio?


Hi Koldo,

Here I do not need /arch:AVX or any other compiler flag added. It's just that include (#include <smmintrin.h> or #include <emmintrin.h>, which works for me, I think.

Best regards,

Tom

EDIT: Mirek was faster to respond! Smile

[Updated on: Fri, 22 May 2020 11:34]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54036 is a reply to message #54033] Fri, 22 May 2020 11:39 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
[quote title=Tom1 wrote on Fri, 22 May 2020 11:13]Quote:

Then I also thought of handling something like <= 16 .. 32 sizes inline and the rest in a deeper never_inline function. This would probably improve the situation without adding so much complexity.


In the trunk now... >=16 now handled by non-inline function. There is impact in your benchmark (the one that runs for all sizes), less impact in my benchmark (with ransom sizes), but I think this is the right move...

Another benefit is that we can now consider using AVX (testing for AVX presence would be clumsy in inline function I think).

Mirek
Re: BufferPainter::Clear() optimization [message #54038 is a reply to message #54036] Fri, 22 May 2020 11:46 Go to previous messageGo to next message
Tom1
Messages: 794
Registered: March 2007
Contributor
[quote title=mirek wrote on Fri, 22 May 2020 12:39]Tom1 wrote on Fri, 22 May 2020 11:13
Quote:

Then I also thought of handling something like <= 16 .. 32 sizes inline and the rest in a deeper never_inline function. This would probably improve the situation without adding so much complexity.


In the trunk now... >=16 now handled by non-inline function. There is impact in your benchmark (the one that runs for all sizes), less impact in my benchmark (with ransom sizes), but I think this is the right move...

Another benefit is that we can now consider using AVX (testing for AVX presence would be clumsy in inline function I think).

Mirek


The apex_memmove() did the architecture checking on startup (or first run) and then initialized function pointers to optimal versions. I think we could do this too in some INITBLOCK.

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #54039 is a reply to message #54036] Fri, 22 May 2020 11:59 Go to previous messageGo to next message
Tom1
Messages: 794
Registered: March 2007
Contributor
mirek wrote on Fri, 22 May 2020 12:39

In the trunk now... >=16 now handled by non-inline function. There is impact in your benchmark (the one that runs for all sizes), less impact in my benchmark (with ransom sizes), but I think this is the right move...

It looks like >32 might be better in this case... Not sure though.

BR, Tom
Re: BufferPainter::Clear() optimization [message #54040 is a reply to message #54039] Fri, 22 May 2020 12:47 Go to previous messageGo to next message
koldo is currently online  koldo
Messages: 3129
Registered: August 2008
Ultimate Member
Dear colleagues

Please consider Sender proposal:
- Remove #include <emmintrin.h> from Blit.h
- Include #include <immintrin.h> in config.h

As now the intrinsics are included inside Upp namespace, they cannot be used later by Eigen.
config.h is included in Core.h before Upp namespace.

Thank you!


Best regards
Iñaki

[Updated on: Fri, 22 May 2020 12:51]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54042 is a reply to message #54039] Fri, 22 May 2020 13:01 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
Tom1 wrote on Fri, 22 May 2020 11:59
mirek wrote on Fri, 22 May 2020 12:39

In the trunk now... >=16 now handled by non-inline function. There is impact in your benchmark (the one that runs for all sizes), less impact in my benchmark (with ransom sizes), but I think this is the right move...

It looks like >32 might be better in this case... Not sure though.

BR, Tom


It in turn makes inlined part bigger.... I would rather be careful there.

OK, for what is worth, I have tried with AVX and I do not see any improvement. Here is the code (for CLANG):

__attribute__((target ("avx")))
never_inline
void memsetd_l2(dword *t, dword data, size_t len)
{
	__m128i val4 = _mm_set1_epi32(data);
	__m256i val8 = _mm256_set1_epi32(data);
	auto Set4 = [&](size_t at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
	#define Set8(at) _mm256_storeu_si256((__m256i *)(t + at), val8);
	Set4(len - 4); // fill tail
	if(len >= 32) {
		if(len >= 1024*1024) { // for really huge data, bypass the cache
			huge_memsetd(t, data, len);
			return;
		}
		Set8(0); // align up on 16 bytes boundary
		const dword *e = t + len;
		t = (dword *)(((uintptr_t)t | 31) + 1);
		len = e - t;
		e -= 32;
		while(t <= e) {
			Set8(0); Set8(8); Set8(16); Set8(24);
			t += 32;
		}
	}
	if(len & 16) {
		Set8(0); Set8(8);
		t += 16;
	}
	if(len & 8) {
		Set8(0);
		t += 8;
	}
	if(len & 4)
		Set4(0);
}

inline
void FillX(void *p, dword data, size_t len)
{
	dword *t = (dword *)p;
	if(len < 4) {
		if(len & 2) {
			t[0] = t[1] = t[len - 1] = data;
			return;
		}
		if(len & 1)
			t[0] = data;
		return;
	}

	if(len >= 16) {
		memsetd_l2(t, data, len);
		return;
	}

	__m128i val4 = _mm_set1_epi32(data);
	auto Set4 = [&](size_t at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
	Set4(len - 4); // fill tail
	if(len & 8) {
		Set4(0); Set4(4);
		t += 8;
	}
	if(len & 4)
		Set4(0);
}



Frankly I am sort of happy, because GCC/CLANG way of dealing with AVX is really stupid: It declines AVX instrinics, unless you compile whole function for AVX code, but then it starts generating AVX opcodes everywhere and the funciton does not run on non-AVX CPUs anymore.
Re: BufferPainter::Clear() optimization [message #54043 is a reply to message #54042] Fri, 22 May 2020 13:06 Go to previous messageGo to next message
Tom1
Messages: 794
Registered: March 2007
Contributor
Quote:
I have tried with AVX and I do not see any improvement.


So, this means SSE2 is enough to saturate the memory bus completely.

Thanks also for the new memcpy optimizations. This is equally important in many areas. Smile

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #54045 is a reply to message #54043] Fri, 22 May 2020 16:58 Go to previous messageGo to next message
koldo is currently online  koldo
Messages: 3129
Registered: August 2008
Ultimate Member
Problem solved. Thank you!

Best regards
Iñaki
Re: BufferPainter::Clear() optimization [message #54046 is a reply to message #54045] Fri, 22 May 2020 19:03 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
Added memcpy optimized for sizeof 8 and 16 and this little neat function to make sense from it all:

template <class T>
void memcpy_t(T *t, const T *s, size_t count)
{
	if((sizeof(T) & 15) == 0)
		memcpydq((dqword *)t, (const dqword *)s, count * (sizeof(T) >> 4));
	else
	if((sizeof(T) & 7) == 0)
		memcpyq((qword *)t, (const qword *)s, count * (sizeof(T) >> 3));
	else
	if((sizeof(T) & 3) == 0)
		memcpyd((dword *)t, (const dword *)s, count * (sizeof(T) >> 2));
	else
		svo_memcpy((void *)t, (void *)s, count * sizeof(T));
}


Vector<String>::ReAlloc(int newalloc)

disassembly now looks magnificent, copying elements to new buffer with SSE2...
Re: BufferPainter::Clear() optimization [message #54049 is a reply to message #54046] Sun, 24 May 2020 10:20 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
I have the first implementation and test of SSE2 AlphaBlend:

TIMING SSE            : 46.95 ms - 46.95 ns (58.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 123.95 ms - 123.95 ns (135.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000


Re: BufferPainter::Clear() optimization [message #54050 is a reply to message #54049] Sun, 24 May 2020 11:56 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 669
Registered: August 2007
Contributor
Hello Mirek,

On Linux 5.4 and 5.6, with CLANG 10.0

TIMING SSE            : 119.41 ms - 119.41 ns ( 1.06 s  / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 232.41 ms - 232.41 ns ( 1.18 s  / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000


On GCC (10.1): apparently _mm_storeu_si32 is yet to be implemented. : Confused

'_mm_storeu_si32' was not declared in this scope; did you mean '_mm_storeu_epi32'?
 (): 47 |  _mm_storeu_si32(rgba, PackRGBA(x, _mm_setzero_si128()));
 (): |  ^~~~~~~~~~~~~~~
 (): |  _mm_storeu_epi32



Possible workaround is given here:

https://stackoverflow.com/questions/58063933/how-can-a-sse2- function-be-missing-from-the-header-it-is-supposed-to-be-in


Best regards,
Oblivion


Re: BufferPainter::Clear() optimization [message #54056 is a reply to message #54049] Tue, 26 May 2020 13:14 Go to previous messageGo to next message
Tom1
Messages: 794
Registered: March 2007
Contributor
Hi!

Sorry for the delay... I was out of town for a while.

Here are my results for Windows 10 pro x64 on Core i7:

MSBT19x64:

TIMING SSE            : 37.08 ms - 37.08 ns (50.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 129.08 ms - 129.08 ns (142.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000

MSBT19:

TIMING SSE            : 29.88 ms - 29.88 ns (45.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 125.88 ms - 125.88 ns (141.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000

CLANG:

TIMING SSE            : 37.41 ms - 37.41 ns (50.00 ms / 1000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 125.41 ms - 125.41 ns (138.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000

CLANGx64:

TIMING SSE            : 37.43 ms - 37.43 ns (47.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 129.43 ms - 129.43 ns (139.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000


Impressive numbers Mirek! When is this going to be available on BufferPainter?

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #54057 is a reply to message #54056] Tue, 26 May 2020 14:15 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
Tom1 wrote on Tue, 26 May 2020 13:14
Hi!

Sorry for the delay... I was out of town for a while.

Here are my results for Windows 10 pro x64 on Core i7:

MSBT19x64:

TIMING SSE            : 37.08 ms - 37.08 ns (50.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 129.08 ms - 129.08 ns (142.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000

MSBT19:

TIMING SSE            : 29.88 ms - 29.88 ns (45.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 125.88 ms - 125.88 ns (141.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000

CLANG:

TIMING SSE            : 37.41 ms - 37.41 ns (50.00 ms / 1000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 125.41 ms - 125.41 ns (138.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000

CLANGx64:

TIMING SSE            : 37.43 ms - 37.43 ns (47.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000
TIMING Non SSE        : 129.43 ms - 129.43 ns (139.00 ms / 1000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 1000000


Impressive numbers Mirek! When is this going to be available on BufferPainter?

Best regards,

Tom


I guess by the end of the week. Still fixing bugs + there is like 8 variants to implement...
Re: BufferPainter::Clear() optimization [message #54099 is a reply to message #54057] Mon, 01 June 2020 00:39 Go to previous messageGo to previous message
mirek is currently offline  mirek
Messages: 12536
Registered: November 2005
Ultimate Member
While optimizing memcpy and memset, I have tried a new look at othre things, like String comparison and memhash. I think I improved String::operator== a tiny bit and now I am working on memhash function. Decided to introduce "hash_t" and to have hash value 64 bit when CPU_64.

After bit of experimenting, I have found these functions (one for 64 bit, other 32 bit) work best:

never_inline
uint64 memhash64(const void *ptr, int len)
{
	const byte *s = (byte *)ptr;
	uint64 val = HASH64_CONST1;
	if(len >= 8) {
		if(len >= 32) {
			uint64 val1, val2, val3, val4;
			val1 = val2 = val3 = val4 = HASH64_CONST1;
			while(len >= 32) {
				val1 = HASH64_CONST2 * val1 + *(qword *)(s);
				val2 = HASH64_CONST2 * val2 + *(qword *)(s + 8);
				val3 = HASH64_CONST2 * val3 + *(qword *)(s + 16);
				val4 = HASH64_CONST2 * val4 + *(qword *)(s + 24);
				s += 32;
				len -= 32;
			}
			val = HASH64_CONST2 * val + val1;
			val = HASH64_CONST2 * val + val2;
			val = HASH64_CONST2 * val + val3;
			val = HASH64_CONST2 * val + val4;
		}
		const byte *e = s + len - 8;
		while(s < e) {
			val = HASH64_CONST2 * val + *(qword *)(s);
			s += 8;
		}
		return HASH64_CONST2 * val + *(qword *)(e);
	}
	if(len > 4) {
		val = HASH64_CONST2 * val + *(dword *)(s);
		val = HASH64_CONST2 * val + *(dword *)(s + len - 4);
		return val;
	}
	if(len >= 2) {
		val = HASH64_CONST2 * val + *(word *)(s);
		val = HASH64_CONST2 * val + *(word *)(s + len - 2);
		return val;
	}
	return len ? HASH64_CONST2 * val + *s : val;
}

never_inline
uint64 memhash32(const void *ptr, int len)
{
	const byte *s = (byte *)ptr;
	uint64 val = HASH32_CONST1;
	if(len >= 4) {
		if(len >= 16) {
			uint64 val1, val2, val3, val4;
			val1 = val2 = val3 = val4 = HASH32_CONST1;
			while(len >= 32) {
				val1 = HASH32_CONST2 * val1 + *(dword *)(s);
				val2 = HASH32_CONST2 * val2 + *(dword *)(s + 4);
				val3 = HASH32_CONST2 * val3 + *(dword *)(s + 8);
				val4 = HASH32_CONST2 * val4 + *(dword *)(s + 12);
				s += 16;
				len -= 16;
			}
			val = HASH32_CONST2 * val + val1;
			val = HASH32_CONST2 * val + val2;
			val = HASH32_CONST2 * val + val3;
			val = HASH32_CONST2 * val + val4;
		}
		const byte *e = s + len - 4;
		while(s < e) {
			val = HASH32_CONST2 * val + *(dword *)(s);
			s += 4;
		}
		return HASH32_CONST2 * val + *(dword *)(e);
	}
	if(len >= 2) {
		val = HASH32_CONST2 * val + *(word *)(s);
		val = HASH32_CONST2 * val + *(word *)(s + len - 2);
		return val;
	}
	return len ? HASH32_CONST2 * val + *s : val;
}


While other "mem*" functions are easy to write tests for, hasing is a bit more complicated; can I request some code review here? Basically, I think combination functions are OK, but I would like to be sure it reads exactly len bytes from memory (it is ok if some are read twice...).

Mirek
Previous Topic: Should we still care about big-endian CPUs?
Next Topic: Technology lab vs External resources
Goto Forum:
  


Current Time: Sat Jun 06 20:20:12 CEST 2020

Total time taken to generate the page: 0.01735 seconds