U++ framework: U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)

Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

Re: BufferPainter::Clear() optimization [message #53972 is a reply to message #53970]

Mon, 18 May 2020 20:57

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

My CPU here is Intel(R) Core(TM) i7-4790K:

https://ark.intel.com/content/www/us/en/ark/products/80807/i ntel-core-i7-4790k-processor-8m-cache-up-to-4-40-ghz.html

Not surprisingly, they say it has an 8M 'smart cache'.

Please find attached two CSV files portraying execution time in ns for each call in average. The length is in dwords. Fill3a is there for reference and Fill3T is using 64 dword threshold for streaming in one and 2M dword (8MB) threshold in the other file. While not portrayed here, increasing the threshold above 32MB decreases the performance from 1.5 ms to 3.6 ms for a 32 MB buffer.

Best regards,

Tom

Attachment: Fill3T-cache-comparison.7z
(Size: 1.61KB, Downloaded 283 times)

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53973 is a reply to message #53972]

Mon, 18 May 2020 21:20

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

It looks that luckily CPUID reveals cache information:

https://www.intel.com/content/www/us/en/architecture-and-tec hnology/64-ia-32-architectures-software-developer-vol-2a-man ual.html

Initial value in EAX 80000006H
ECX:
Bits 07 - 00: Cache Line size in bytes.
Bits 11 - 08: Reserved.
Bits 15 - 12: L2 Associativity field.
Bits 31 - 16: Cache size in 1K units.

Could we use this?

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53974 is a reply to message #53973]

Mon, 18 May 2020 21:40

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi Mirek,

Something like this, maybe... I'm not quite sure as this method reports 16M cache for me -- although this works quite well for me:

static int cachesize=999;
INITBLOCK{
#ifdef COMPILER_MSC
	int cpuInfo[4];
	Zero(cpuInfo);
	__cpuid(cpuInfo, 0x80000006);
#else
	unsigned int cpuInfo[4];
	Zero(cpuInfo);
	__get_cpuid(0x80000006, &cpuInfo[0], &cpuInfo[1], &cpuInfo[2], &cpuInfo[3]);
#endif
	cachesize=1024*(cpuInfo[2]>>16)*(cpuInfo[2]&0xff);
};


void inline Fill3T(void *b, dword data, int len){
	switch(len){
		case 3: ((dword *)b)[2] = data;
		case 2: ((dword *)b)[1] = data;
		case 1: ((dword *)b)[0] = data;
		case 0: return;
	}
	__m128i q = _mm_set1_epi32(*(int*)&data);
	__m128i *w = (__m128i*)b;
	
	if(len >= 32) {
		__m128i *e = (__m128i*)b + (len>>2) - 8;
		if(len >= (cachesize>>2) && ((uintptr_t)w & 3) == 0) { // for really huge data, bypass the cache
			_mm_storeu_si128(w, q); // Head align
			int s=(-((int)((uintptr_t)b)>>2))&0x3;
			w = (__m128i*) ((dword*)b) + s;
			do {
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
			}while(w<=e);
			_mm_sfence();
		}
		else
			do {
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
			}while(w<=e);
	}
	
	if(len & 16) {
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
	}
	if(len & 8) {
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
	}
	if(len & 4) {
		_mm_storeu_si128(w, q);
	}
	_mm_storeu_si128((__m128i*) (((dword*)b) + len - 4), q); // Tail align
}

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53975 is a reply to message #53972]

Mon, 18 May 2020 21:56

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Mon, 18 May 2020 20:57

Hi,

My CPU here is Intel(R) Core(TM) i7-4790K:

https://ark.intel.com/content/www/us/en/ark/products/80807/i ntel-core-i7-4790k-processor-8m-cache-up-to-4-40-ghz.html

Not surprisingly, they say it has an 8M 'smart cache'.

Please find attached two CSV files portraying execution time in ns for each call in average. The length is in dwords. Fill3a is there for reference and Fill3T is using 64 dword threshold for streaming in one and 2M dword (8MB) threshold in the other file. While not portrayed here, increasing the threshold above 32MB decreases the performance from 1.5 ms to 3.6 ms for a 32 MB buffer.

Best regards,

Tom

If I interpret these numbers correctly, it looks like around 4MB potential drop because of cache bypass starts to be diminish, right?

Thing is, I am afraid that making this dynamic will cause a lot of problems, starting with perfromance - it is after all another read from the memory. I would settle for some compromise constant there. Like 4MB... Smile

Smile

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53977 is a reply to message #53751]

Tue, 19 May 2020 00:02

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

What about this:

never_inline
void HugeFill(dword *t, dword c, int len)
{
	__m128i val4 = _mm_set1_epi32(*(int*)&c);
	auto Set4S = [&](int at) { _mm_stream_si128((__m128i *)(t + at), val4); };
	while((uintptr_t)t & 15) { // align to 16 bytes for SSE
		*t++ = c;
		len--;
	}
	while(len >= 16) {
		Set4S(0);
		Set4S(4);
		Set4S(8);
		Set4S(12);
		t += 16;
		len -= 16;
	}
	while(len--)
		*t++ = c;
	_mm_sfence();
}

void Fill6(dword *t, dword c, int len)
{
	if(len >= 4) {
		__m128i val4 = _mm_set1_epi32(*(int*)&c);
		auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
		if(len > 4*1024*1024 / 4) {
			HugeFill(t, c, len);
			return;
		}
		while(len >= 16) {
			Set4(0);
			Set4(4);
			Set4(8);
			Set4(12);
			t += 16;
			len -= 16;
		}
		if(len & 8) {
			Set4(0);
			Set4(4);
			t += 8;
		}
		if(len & 4) {
			Set4(0);
			t += 4;
		}
	}
	if(len & 3)
		t[0] = t[(len & 2) >> 1] = t[(len & 2) & ((len & 1) << 1)] = c;
}

[Updated on: Tue, 19 May 2020 09:01]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53979 is a reply to message #53977]

Tue, 19 May 2020 08:59

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

Fill6 fails integrity check due to a small indexing glitch here:

	if(len & 8) {
		Set4(0);
		Set4(8); // << Should be 4
		t += 8;
	}

However, Fill3T is still faster below 64 and mostly on par above that on my i7.

And thanks! I do indeed enjoy the final alignment trick! Smile

Smile

Very clever!

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53980 is a reply to message #53979]

Tue, 19 May 2020 09:14

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Yeah, there was another bug in it, I should test more before posting.

In retrospective, while the trick is nice, I do not think it is worth it. But if you wanted to experiment with this path, I have found the way how to extend / simplify this. The basic idea is

int nlen = -len;
t[1 & HIBYTE(nlen)] = c;
nlen++;
t[2 & HIBYTE(nlen)] = c;
nlen++;
t[3 & HIBYTE(nlen)] = c;
....

(at some point, nlen will become > 0 and thus HIBYTE goes from 0xff to 0x00, thus "grounding" indices).

Also, I would like to try to explain why I am trying to beat Fill3T. It is about those switches, while

switch(len) {
case 0:
case 1:
case 2:
default:
}

looks magnificent, it is actually 2 "unstable" branch predictions and quite a bit of code to compute the target address. So

if(len & 2) {
}
if(len & 1) {
}

should be on par - 2 branch predictions and maybe a bit less of code....

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53981 is a reply to message #53751]

Tue, 19 May 2020 09:49

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Also, a little note about your testing code: You loop over the same "len" many times and measure that. The problem is that first pass setups branch prediction so all other passes are predicted. If "len" is changing, prediction fails and you might get different results....

Which explains why my tests, which feeds random lens, shows a bit different picture... Smile

Smile

All in all, I think in the end we will just need to test this with Painter....

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53982 is a reply to message #53751]

Tue, 19 May 2020 11:32

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Three more variants, based on your FillT. Fill7 is basically identical, with a little trick added (hope you like it). Fill7a has different "frontend". Fill8 is not performing very well, adding that just so that you know I have tested that variant too... Smile

Smile

Fill7 and Fill7a seem to be basically equal and maybe just a tiny bit faster than Fill3T....

void Fill7(dword *t, dword data, int len){
	switch(len) {
		case 3: t[2] = data;
		case 2: t[1] = data;
		case 1: t[0] = data;
		case 0: return;
	}

	__m128i val4 = _mm_set1_epi32(data);
	auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };

	Set4(len - 4); // fill tail
	if(len >= 32) {
		if(len >= 1024*1024) { // for really huge data, bypass the cache
			HugeFill(t, data, len);
			return;
		}
		const dword *e = t + len - 32;
		do {
			Set4(0); Set4(4); Set4(8); Set4(12);
			Set4(16); Set4(20); Set4(24); Set4(28);
			t += 32;
		}
		while(t <= e);
	}
	if(len & 16) {
		Set4(0); Set4(4); Set4(8); Set4(12);
		t += 16;
	}
	if(len & 8) {
		Set4(0); Set4(4);
		t += 8;
	}
	if(len & 4)
		Set4(0);
}

void Fill7a(dword *t, dword data, int len){
	if(len < 4) {
		if(len & 2) {
			t[0] = t[1] = data;
			t += 2;
		}
		if(len & 1)
			t[0] = data;
		return;
	}

	__m128i val4 = _mm_set1_epi32(data);
	auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };

	Set4(len - 4); // fill tail
	if(len >= 32) {
		if(len >= 1024*1024) { // for really huge data, bypass the cache
			HugeFill(t, data, len);
			return;
		}
		const dword *e = t + len - 32;
		do {
			Set4(0); Set4(4); Set4(8); Set4(12);
			Set4(16); Set4(20); Set4(24); Set4(28);
			t += 32;
		}
		while(t <= e);
	}
	if(len & 16) {
		Set4(0); Set4(4); Set4(8); Set4(12);
		t += 16;
	}
	if(len & 8) {
		Set4(0); Set4(4);
		t += 8;
	}
	if(len & 4)
		Set4(0);
}

void Fill8(dword *t, dword data, int len){
	switch(len) {
		case 3: t[2] = data;
		case 2: t[1] = data;
		case 1: t[0] = data;
		case 0: return;
	}

	__m128i val4 = _mm_set1_epi32(data);
	auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };

	Set4(len - 4); // fill tail
	if(len >= 32) {
		if(len >= 1024*1024) { // for really huge data, bypass the cache
			HugeFill(t, data, len);
			return;
		}
		int cnt = len >> 5;
		do {
			Set4(0); Set4(4); Set4(8); Set4(12);
			len -= 32;
			Set4(16); Set4(20); Set4(24); Set4(28);
			t += 32;
		}
		while(len >= 32);
	}
	switch((len >> 2) & 7) {
	case 7: Set4(24);
	case 6: Set4(20);
	case 5: Set4(16);
	case 4: Set4(12);
	case 3: Set4(8);
	case 2: Set4(4);
	case 1: Set4(0);
	}
}

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53983 is a reply to message #53981]

Tue, 19 May 2020 12:35

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

mirek wrote on Tue, 19 May 2020 10:49

Also, a little note about your testing code: You loop over the same "len" many times and measure that. The problem is that first pass setups branch prediction so all other passes are predicted. If "len" is changing, prediction fails and you might get different results....

Which explains why my tests, which feeds random lens, shows a bit different picture...

All in all, I think in the end we will just need to test this with Painter....

Mirek

I wish I came to think of this benchmarking pitfall... I mean the branch prediction. Well, I agree that we need to put it in the BufferPainter environment for real test.

Meanwhile, as you worked on 7, 7a and 8, I prepared 3T2, which avoids the switch and uses ifs instead. Funnily, your 7a does the same, but with table offsets. Smile

Smile

void inline Fill3T2(dword *b, dword data, int len){
	if(len<4){
		if(len&1) *b++ = data;
		if(len&2){ *b++ = data; *b++ = data; }
		return;
	}

	__m128i q = _mm_set1_epi32(*(int*)&data);
	__m128i *w = (__m128i*)b;
	
	if(len >= 32) {
		__m128i *e = (__m128i*)b + (len>>2) - 8;
		if(len > 4*1024*1024 / 4 && ((uintptr_t)w & 3) == 0) { // for really huge data, bypass the cache
			_mm_storeu_si128(w, q); // Head align
			int s=(-((int)((uintptr_t)b)>>2))&0x3;
			w = (__m128i*) (b + s);
			do {
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
			}while(w<=e);
			_mm_sfence();
		}
		else
			do {
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
			}while(w<=e);
	}
	
	if(len & 16) {
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
	}
	if(len & 8) {
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
	}
	if(len & 4) {
		_mm_storeu_si128(w, q);
	}
	_mm_storeu_si128((__m128i*) (b + len - 4), q); // Tail align
}

I really like the w++ incremental pointer logic over the Set4(pointer+offset). This approach seems to give a small improvement on my system.

Next, I will test your 7 + 7a and report against 3T2.

But seriously, we need to put an end to this madness! The bang for the buck is rapidly decreasing as working hours are increasing... Smile

Smile

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53984 is a reply to message #53983]

Tue, 19 May 2020 12:45

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

[quote title=Tom1 wrote on Tue, 19 May 2020 12:35]mirek wrote on Tue, 19 May 2020 10:49

I really like the w++ incremental pointer logic over the Set4(pointer+offset). This approach seems to give a small improvement on my system.

Compiler actually converts that to offsets anyway... (I have checked disassembly).

Quote:

But seriously, we need to put an end to this madness! The bang for the buck is rapidly decreasing as working hours are increasing...

Smile

Well, you have started it Smile

Smile

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53985 is a reply to message #53984]

Tue, 19 May 2020 13:18

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

[quote title=mirek wrote on Tue, 19 May 2020 13:45]Tom1 wrote on Tue, 19 May 2020 12:35

mirek wrote on Tue, 19 May 2020 10:49

I really like the w++ incremental pointer logic over the Set4(pointer+offset). This approach seems to give a small improvement on my system.

Compiler actually converts that to offsets anyway... (I have checked disassembly).

Quote:

But seriously, we need to put an end to this madness! The bang for the buck is rapidly decreasing as working hours are increasing...

Well, you have started it

Mirek

Laughing

I admit to it! My fault... Smile

Smile

Anyway, pick you choice: 7a or 3T2, but note that MSBT19 (32bit I mean) likes 3T2 better on short transfers. CLANG, CLANGx64 and MSBT19x64 are happy with both. (But, please do your own benchmarks, as this is just my repeated scan through different lengths with the pitfall you pointed out earlier.)

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53986 is a reply to message #53985]

Tue, 19 May 2020 16:22

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

WARNING: Something still wrong in 3T2 alignment code. I will continue to investigate it.

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53989 is a reply to message #53986]

Wed, 20 May 2020 01:34

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi Mirek,

Yes, I'm nuts... still working at this hour.

Anyway, here's a new version - Fill3T3 - that can actually handle all alignment variations (even those not handled by 7a). Please benchmark and check for correctness:

never_inline void FillStream(dword *b, dword data, int len){
	
	while((uintptr_t)b & 15){ // Try to align
		*b++=data;
		len--;
	};
	__m128i *w = (__m128i *)b;
	__m128i q = _mm_set1_epi32((int)data);
	if(len>=16){
		__m128i *e = w + (len>>2) - 3;
		do{
			_mm_stream_si128(w++, q);
			_mm_stream_si128(w++, q);
			_mm_stream_si128(w++, q);
			_mm_stream_si128(w++, q);
		}while(w<e);
	}
	if(len & 8) {
		_mm_stream_si128(w++, q);
		_mm_stream_si128(w++, q);
	}
	if(len & 4) {
		_mm_stream_si128(w++, q);
	}
	_mm_sfence();
	_mm_storeu_si128((__m128i*)(b + len - 4), q); // Tail align
}

void inline Fill3T3(dword *b, dword data, int len){
	if(len<4){
		if(len&1) *b++ = data;
		if(len&2){ *b++ = data; *b++ = data; }
		return;
	}

	__m128i *w = (__m128i *)b;
	__m128i q = _mm_set1_epi32((int)data);

	if(len >= 32) {
		if(len>1024*1024 && (((uintptr_t)b & 3)==0)){
			FillStream(b,data,len);
			return;
		}
		
		__m128i *e = w + (len>>2) - 7;
		do{
			_mm_storeu_si128(w++, q);
			_mm_storeu_si128(w++, q);
			_mm_storeu_si128(w++, q);
			_mm_storeu_si128(w++, q);
			_mm_storeu_si128(w++, q);
			_mm_storeu_si128(w++, q);
			_mm_storeu_si128(w++, q);
			_mm_storeu_si128(w++, q);
		}while(w<e);
	}
	if(len & 16) {
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
	}
	if(len & 8) {
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
	}
	if(len & 4) {
		_mm_storeu_si128(w++, q);
	}
	_mm_storeu_si128((__m128i*)(b + len - 4), q); // Tail align

}

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53990 is a reply to message #53989]

Wed, 20 May 2020 01:52

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

[quote title=Tom1 wrote on Wed, 20 May 2020 01:34]Hi Mirek,

Yes, I'm nuts... still working at this hour.

Anyway, here's a new version - Fill3T3 - that can actually handle all alignment variations (even those not handled by 7a). Please benchmark and check for correctness:

	if(len & 8) {
		_mm_stream_si128(w++, q);
		_mm_stream_si128(w++, q);
	}
	if(len & 4) {
		_mm_stream_si128(w++, q);
	}

Yeah, I think that after filling 8MB of data, this will really have impact compared to trivial loop Smile

Smile

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53991 is a reply to message #53990]

Wed, 20 May 2020 08:22

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

mirek wrote on Wed, 20 May 2020 02:52

Yeah, I think that after filling 8MB of data, this will really have impact compared to trivial loop

Mirek

Laughing

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53993 is a reply to message #53991]

Wed, 20 May 2020 10:04

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

There must still be something wrong with 3T3 because applying it to BufferPainter (replacing FillRGBA) causes artifacts in drawing. E.g. PainterExamples spiral example at 3x scale clearly shows noise in line edges. Sad

Sad

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53994 is a reply to message #53993]

Wed, 20 May 2020 10:20

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Wed, 20 May 2020 10:04

Hi,

There must still be something wrong with 3T3 because applying it to BufferPainter (replacing FillRGBA) causes artifacts in drawing. E.g. PainterExamples spiral example at 3x scale clearly shows noise in line edges.

Best regards,

Tom

My guts feeling is either the tail fill, or less likely, "e" computation. I think these are simpler code in Fill7a.... (and actually, these are the only real difference now).

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53995 is a reply to message #53994]

Wed, 20 May 2020 10:55

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

No, Sorry... I'll take that alarm back. There is no error in 3T3 after all. My copy of the code inside Painter was faulty... Now I took the correct version and it is all good now.

I'm just too tired after not sleeping too much lately...

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53996 is a reply to message #53995]

Wed, 20 May 2020 11:56

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

So I have replaced memsetd with Fill7a, replaced RGBA fill with (new) memsetd and did benchmarks.

Except the situation where the benchmark involves Clear of large area, numbers have not changed... Smile

Smile

EDIT: Bug on my part, retesting...

Mirek

[Updated on: Wed, 20 May 2020 12:07]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53997 is a reply to message #53996]

Wed, 20 May 2020 12:23

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

OK, after retesting, I think it might be at most 3% faster. Looking at fillers, I think there is much more time spent in AlphaBlend function - even if it is just for segment start/end pixels. Perhaps that one should be SSE2 optimized? Smile

Smile

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53998 is a reply to message #53997]

Wed, 20 May 2020 12:41

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi Mirek,

Two things to consider before you go with 7a:

- 7a crashes on unaligned buffers (t&3) while 3T3 handles them all.
- 3T3 is faster on MSBT19 with short transfers up to 50-60 dwords.

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53999 is a reply to message #53997]

Wed, 20 May 2020 12:52

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

mirek wrote on Wed, 20 May 2020 13:23

OK, after retesting, I think it might be at most 3% faster. Looking at fillers, I think there is much more time spent in AlphaBlend function - even if it is just for segment start/end pixels. Perhaps that one should be SSE2 optimized?

Mirek

Hi,

My SSE2 battery is now 'discharged' for a while.... Need to recharge before next use. Smile

Smile

I also did some testing on span filler with memcpy. This is based on using IMAGE_OPAQUE of the image being rendered. It does improve the speed somewhat, but the edges cause a problem since the edge is alpha blended even if FILL_FAST is specified. So, this needs some reconsideration and better knowledge on the Painter internals (i.e. beyond my level...):

BufferPainter.h:

struct SpanSource {
	int kind;
	SpanSource(){
		kind = IMAGE_OPAQUE;
	}
	virtual void Get(RGBA *span, int x, int y, unsigned len) = 0;
	virtual ~SpanSource() {}
};

Fillers.cpp:

void SpanFiller::Render(int val, int len)
{
	if(val == 0) {
		t += len;
		s += len;
		return;
	}
	if(alpha != 256)
		val = alpha * val >> 8;

	if(val == 256) {
		if(ss->kind==IMAGE_OPAQUE) memcpy(t,s,len*sizeof(RGBA)); // apex_memcpy() would be even faster
		else{
			for(int i = 0; i < len; i++) {
				if(s[i].a == 255)
					t[i] = s[i];
				else
					AlphaBlend(t[i], s[i]);
			}
		}
		t += len;
		s += len;
	}
	else {
		const RGBA *e = t + len;
		while(t < e)
			AlphaBlendCover8(*t++, *s++, val);
	}
}

Painter/Image.cpp:


struct PainterImageSpan : SpanSource, PainterImageSpanData {
	LinearInterpolator interpolator;

	PainterImageSpan(const PainterImageSpanData& f)
	:	PainterImageSpanData(f) {
		interpolator.Set(xform);
		kind = image.GetKindNoScan(); // Add this
	}

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54000 is a reply to message #53998]

Wed, 20 May 2020 12:53

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

I was aware about unaligned problem, thats fixed in final version. That said, unaligned in general should be considered illegal anyway, because otherwise hell will broke lose with Armv6....

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54002 is a reply to message #54000]

Wed, 20 May 2020 13:01

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Quote:

I was aware about unaligned problem, thats fixed in final version. That said, unaligned in general should be considered illegal anyway, because otherwise hell will broke lose with Armv6....

But that's good to know. In this case we could drop (t&3) code entirely from 3T3 and improve instruction cache locality for even better results on short transfers.

((Is there a way to 'cleanly crash' (whatever that might mean) an application attempting unaligned memset? Now it just disappears from the process list at least on Windows.))

EDIT: Let me rephrase it: Is there a way to check during development that an application will never use unaligned memset?

Best regards,

Tom

[Updated on: Wed, 20 May 2020 13:09]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54003 is a reply to message #54002]

Wed, 20 May 2020 15:18

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Wed, 20 May 2020 13:01

EDIT: Let me rephrase it: Is there a way to check during development that an application will never use unaligned memset?

memsetd!

Yes, put ASSERT(((uintptr_t)t & 3) == 0); to memsetd Smile

Smile

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54004 is a reply to message #53998]

Wed, 20 May 2020 15:58

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Wed, 20 May 2020 12:41

- 3T3 is faster on MSBT19 with short transfers up to 50-60 dwords.

Interestingly, adding "inline" to it seems to fix the problem... Smile

Smile

For some reason, 32-bit MSC does not inline it unless you ask it to do so...

In fact, assembler for both inlined function is virtually the same, the only difference is different tail handling (IMO, mine is 1% esier on eye):

7a:

0017940D  cmp ecx,byte +0x4 
00179410  jnl 0x17942f 
00179412  test cl,0x2 
00179415  jz 0x17941f 
00179417  mov [eax+0x4],edi 
0017941A  mov [eax],edi 
0017941C  add eax,byte +0x8 
0017941F  test cl,0x1 
00179422  jz dword 0x1794b4 
00179428  mov [eax],edi 
0017942A  jmp dword 0x1794b4 
0017942F  movd xmm0,edi 
00179433  pshufd xmm0,xmm0,0x0 
00179438  movups [eax+ecx*4-0x10],xmm0     <<= tail handling
0017943D  cmp ecx,byte +0x20 
00179440  jl 0x179486 
00179442  cmp ecx,0x100000 
00179448  jl 0x179457 
0017944A  push ecx 
0017944B  push edi 
0017944C  push eax 
0017944D  call dword 0x14ff88 
00179452  add esp,byte +0xc 
00179455  jmp short 0x1794b4 
00179457  lea edx,[ecx-0x20] 
0017945A  lea edx,[eax+edx*4] 
0017945D  nop dword [eax] 
00179460  movups [eax],xmm0 
00179463  movups [eax+0x10],xmm0 
00179467  movups [eax+0x20],xmm0 
0017946B  movups [eax+0x30],xmm0 
0017946F  movups [eax+0x40],xmm0 
00179473  movups [eax+0x50],xmm0 
00179477  movups [eax+0x60],xmm0 
0017947B  movups [eax+0x70],xmm0 
0017947F  sub eax,byte -0x80 
00179482  cmp eax,edx 
00179484  jna 0x179460 
00179486  test cl,0x10 
00179489  jz 0x17949d 
0017948B  movups [eax],xmm0 
0017948E  movups [eax+0x10],xmm0 
00179492  movups [eax+0x20],xmm0 
00179496  movups [eax+0x30],xmm0 
0017949A  add eax,byte +0x40 
0017949D  test cl,0x8 
001794A0  jz 0x1794ac 
001794A2  movups [eax],xmm0 
001794A5  movups [eax+0x10],xmm0 
001794A9  add eax,byte +0x20 
001794AC  test cl,0x4 
001794AF  jz 0x1794b4 
001794B1  movups [eax],xmm0

3T3

00179540  cmp eax,byte +0x4 
00179543  jnl 0x179560 
00179545  test al,0x1 
00179547  jz 0x17954e 
00179549  mov [edx],edi 
0017954B  add edx,byte +0x4 
0017954E  test al,0x2 
00179550  jz dword 0x179607 
00179556  mov [edx],edi 
00179558  mov [edx+0x4],edi 
0017955B  jmp dword 0x179607 
00179560  movd xmm0,edi 
00179564  mov ecx,edx 
00179566  pshufd xmm0,xmm0,0x0 
0017956B  cmp eax,byte +0x20 
0017956E  jl 0x1795c6 
00179570  cmp eax,0x100000 
00179575  jng 0x179589 
00179577  test dl,0x3 
0017957A  jnz 0x179589 
0017957C  push eax 
0017957D  push edi 
0017957E  push edx 
0017957F  call dword 0x14ff88 
00179584  add esp,byte +0xc 
00179587  jmp short 0x179604 
00179589  mov edi,eax 
0017958B  sar edi,0x2 
0017958E  sub edi,byte +0x7 
00179591  shl edi,0x4 
00179594  add edi,edx 
00179596  mov eax,ecx 
00179598  movups [eax],xmm0 
0017959B  lea eax,[ecx+0x70] 
0017959E  movups [ecx+0x10],xmm0 
001795A2  movups [ecx+0x20],xmm0 
001795A6  movups [ecx+0x30],xmm0 
001795AA  movups [ecx+0x40],xmm0 
001795AE  movups [ecx+0x50],xmm0 
001795B2  movups [ecx+0x60],xmm0 
001795B6  sub ecx,byte -0x80 
001795B9  movups [eax],xmm0 
001795BC  cmp ecx,edi 
001795BE  jc 0x179596 
001795C0  mov eax,[ebp-0x14] 
001795C3  mov edi,[ebp-0x18] 
001795C6  test al,0x10 
001795C8  jz 0x1795e3 
001795CA  mov eax,ecx 
001795CC  movups [eax],xmm0 
001795CF  lea eax,[ecx+0x30] 
001795D2  movups [ecx+0x10],xmm0 
001795D6  movups [ecx+0x20],xmm0 
001795DA  add ecx,byte +0x40 
001795DD  movups [eax],xmm0 
001795E0  mov eax,[ebp-0x14] 
001795E3  test al,0x8 
001795E5  jz 0x1795f8 
001795E7  mov eax,ecx 
001795E9  movups [eax],xmm0 
001795EC  lea eax,[ecx+0x10] 
001795EF  add ecx,byte +0x20 
001795F2  movups [eax],xmm0 
001795F5  mov eax,[ebp-0x14] 
001795F8  test al,0x4 
001795FA  jz 0x1795ff 
001795FC  movups [ecx],xmm0 
001795FF  movups [edx+eax*4-0x10],xmm0     <= TAIL

EDIT: OK, now rechecking it, it looks like 3T3 has a bit more instructions doing weird things....

[Updated on: Wed, 20 May 2020 16:01]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54005 is a reply to message #54004]

Wed, 20 May 2020 16:15

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

This is strange, since I immediately added the inline to 7a when I started testing it. (I found out earlier that MSBT19 did not do it for me.) Now I did a new run and the result is in the attached csv.

Can you post the latest 7a if it is any different compared to the one posted here above?

Best regards,

Tom

Attachment: memset.csv
(Size: 1.95KB, Downloaded 323 times)

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54006 is a reply to message #54005]

Wed, 20 May 2020 17:16

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Wed, 20 May 2020 16:15

Hi,

This is strange, since I immediately added the inline to 7a when I started testing it. (I found out earlier that MSBT19 did not do it for me.) Now I did a new run and the result is in the attached csv.

Can you post the latest 7a if it is any different compared to the one posted here above?

Best regards,

Tom

It is now in trunk as memsetd....

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54007 is a reply to message #54006]

Wed, 20 May 2020 17:31

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

I am getting quite different picture:

	int bsize=8*1024*1024;
	Buffer<dword> b(bsize, 0);
	dword cw = 123;

	String result="\"N\",\"memsetd()\",\"Fill3T3()\"\r\n";
	for(int len=1;len<=bsize;){
		int maximum=100000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++)
			memsetd(~b, cw, len);
		int64 t1=usecs();
		for(int i = 0; i < maximum; i++)
			Fill3T3(~b, cw, len);
		int64 t2=usecs();
		String r = Format("%d,%f,%f",len,1000.0*(t1-t0)/maximum,1000.0*(t2-t1)/maximum);
		RLOG(r);
		result.Cat(r + "\r\n");
		if(len<64) len++;
		else len*=2;
	}
	
	SaveFile(GetHomeDirFile("memset.csv"),result);

I am starting to wonder if there is difference between our MSC 32bit compilers...

Attachment: memset.csv
(Size: 1.90KB, Downloaded 301 times)

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54008 is a reply to message #54007]

Wed, 20 May 2020 17:37

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Ha, funny. It depends on order of functions tested. If I test memsetd second, I am getting different numbers Smile

Smile

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54010 is a reply to message #54007]

Wed, 20 May 2020 19:51

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

mirek wrote on Wed, 20 May 2020 18:31

I am getting quite different picture:

	int bsize=8*1024*1024;
	Buffer<dword> b(bsize, 0);
	dword cw = 123;

	String result="\"N\",\"memsetd()\",\"Fill3T3()\"\r\n";
	for(int len=1;len<=bsize;){
		int maximum=100000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++)
			memsetd(~b, cw, len);
		int64 t1=usecs();
		for(int i = 0; i < maximum; i++)
			Fill3T3(~b, cw, len);
		int64 t2=usecs();
		String r = Format("%d,%f,%f",len,1000.0*(t1-t0)/maximum,1000.0*(t2-t1)/maximum);
		RLOG(r);
		result.Cat(r + "\r\n");
		if(len<64) len++;
		else len*=2;
	}
	
	SaveFile(GetHomeDirFile("memset.csv"),result);

I am starting to wonder if there is difference between our MSC 32bit compilers...

Hi,

No wonder we ended up with (very slightly) different approach... Your results are more or less reversed to what I'm getting. I tried to reorder the calls too, but without any observable difference.

It's either the different CPUs or a different compiler. My compiler is:

Microsoft (R) C/C++ Optimizing Compiler Version 19.21.27702.2 for x86
Copyright (C) Microsoft Corporation.  All rights reserved.

Should I downgrade or upgrade?...

Anyway, seriously I'm pleased with the final result here. The filler is now better than anything before and can be used generally for all clearing/presetting of buffers. I use this a lot in signal processing in addition to clearing the ImageBuffer for BufferPainter. After all, the ImageBuffer needs to be cleared or preset to user preference background color once before each display update. It is much better to have a 1.5 ms delay instead of 3.6 ms delay before drawing approximately 10-20 ms worth of vector map data on the screen. Smile

Smile

Should this new memsetd() now be deployed all over the u++? I mean e.g. Core/Topt.h :: Fill?

Thank you a lot for your great work on this! Smile

Smile

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54011 is a reply to message #54010]

Thu, 21 May 2020 09:04

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

[quote title=Tom1 wrote on Wed, 20 May 2020 19:51]mirek wrote on Wed, 20 May 2020 18:31

Should this new memsetd() now be deployed all over the u++? I mean e.g. Core/Topt.h :: Fill?

IDK, maybe as specialisation...

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54014 is a reply to message #54011]

Thu, 21 May 2020 13:28

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

OK, so I could not stop digging and found last important ingredient: alignment matters!

void FillX(void *p, dword data, int len)
{
	dword *t = (dword *)p;
	if(len < 4) {
		if(len & 2) {
			t[0] = t[1] = t[len - 1] = data;
			return;
		}
		if(len & 1)
			t[0] = data;
		return;
	}

	__m128i val4 = _mm_set1_epi32(data);
	auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };

	Set4(len - 4); // fill tail
	if(len >= 16) {
		Set4(0); // align up on 16 bytes boundary
		const dword *e = t + len;
		t = (dword *)(((uintptr_t)t | 15) + 1);
		len = e - t;
		e -= 16;
		if(len >= 1024*1024) { // for really huge data, bypass the cache
			huge_memsetd(t, data, len);
			return;
		}
		while(t <= e) {
			Set4(0); Set4(4); Set4(8); Set4(12);
			t += 16;
		}
	}
	if(len & 8) {
		Set4(0); Set4(4);
		t += 8;
	}
	if(len & 4)
		Set4(0);
}

This is about twice as fast as Fill7a for len > 60 (up to cache bypass limit).

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54017 is a reply to message #54003]

Thu, 21 May 2020 16:21

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

mirek wrote on Wed, 20 May 2020 16:18

Tom1 wrote on Wed, 20 May 2020 13:01

EDIT: Let me rephrase it: Is there a way to check during development that an application will never use unaligned memset?

memsetd!

Yes, put ASSERT(((uintptr_t)t & 3) == 0); to memsetd

Mirek

Good point! Please do!

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54018 is a reply to message #54014]

Thu, 21 May 2020 16:38

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

This new FillX is incredibly elegant! Congratulations Mirek! I really do like your new findings there. You just need to rename it as memsetd() and place in the correct header in Core... Smile

Smile

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54021 is a reply to message #54018]

Thu, 21 May 2020 17:51

koldo
Messages: 3458
Registered: August 2008

Senior Veteran

Thank you all for your job. Although please review this in Redmine.

Best regards
Iñaki

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54022 is a reply to message #54021]

Thu, 21 May 2020 19:22

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi Koldo,

I checked and #include <emmintrin.h> seems to work just fine for what we are working on. Thanks for pointing this out.

Mirek: Agree?

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54023 is a reply to message #54022]

Thu, 21 May 2020 19:25

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Mirek,

I just found that there is a sweet spot at ~0x3f alignment (i.e. 64 bytes) on my CPU. This is presumably the L1 cache line length, if I'm not mistaken.

Best regards,

Tom

EDIT: It just looks that I cannot squeeze the benefit out as re-alignment code tends to eat what would could possibly be achieved here. However, if allocator could allocate large blocks at even 64 byte limits, that could improve performance behind the scenes.

[Updated on: Thu, 21 May 2020 23:46]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #54026 is a reply to message #54023]

Fri, 22 May 2020 09:32

Didier
Messages: 740
Registered: November 2008
Location: France

Contributor

Hello mirek ans Tom,
Grenat work hère but I have une simple question: what is the point with cache ?
Normally cache speeds things up when you need to reaccess data just After writing it.
So filling a buffer with a constant value that is not read immediatly After in most cases isn't a corresponding use case.
So, I think that having a fill function that doesn't use cache at all will benefit in two points:
Timing stability and more importantly, cache is not touched so it can speed up other functions calls further

Report message to a moderator

Pages (3): [ « ‹ 1 2 3 › »]

Previous Topic:	Should we still care about big-endian CPUs?
Next Topic:	TheIDE crash after switching package

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

PDF

]

Current Time: Wed May 13 01:34:25 GMT+2 2026

Total time taken to generate the page: 0.01356 seconds