U++ framework: U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)

Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)

Show: Today's Messages :: Polls :: Message Navigator
E-mail to friend

SSE2 and SVO optimization (Painter, memcpy....) [message #53751]

Mon, 27 April 2020 19:19

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

Here's an optimization for BufferPainter.

BufferPainter::Clear(RGBA) speed is improved by over 30 % with the following change in Painter/Render.cpp:

void BufferPainter::ClearOp(const RGBA& color)
{
//	UPP::Fill(~*ip, color, ip->GetLength());
	FillRGBA(~*ip, color, ip->GetLength());
	ip->SetKind(color.a == 255 ? IMAGE_OPAQUE : IMAGE_ALPHA);
}

And in Painter/Fillers.h:

namespace Upp {

// Add the following line:
#define FillRGBA(a,b,c) memsetd((a),*(dword*)&(b),(c)) 

struct SolidFiller : Rasterizer::Filler {

This may be significant in some usage scenarios as it can currently take e.g. 4.5 milliseconds to clear a 4K ImageBuffer before drawing to it. This can now be reduced to 2.8 milliseconds.

Best regards,

Tom

EDIT: Changed code to use the newly optimized FillRGBA() found in Fillers.h. This can be found at:
https://www.ultimatepp.org/forums/index.php?t=msg&th=110 11&goto=53752&#msg_53752

[Updated on: Sun, 24 May 2020 10:22] by Moderator

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53757 is a reply to message #53751]

Tue, 28 April 2020 10:12

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Mon, 27 April 2020 19:19

Hi,

Here's an optimization for BufferPainter.

BufferPainter::Clear(RGBA) speed is improved by over 30 % with the following change in Painter/Render.cpp:
void BufferPainter::ClearOp(const RGBA& color)
{
//	UPP::Fill(~*ip, color, ip->GetLength());
	FillRGBA(~*ip, color, ip->GetLength());
	ip->SetKind(color.a == 255 ? IMAGE_OPAQUE : IMAGE_ALPHA);
}
And in Painter/Fillers.h:
namespace Upp {

// Add the following line:
#define FillRGBA(a,b,c) memsetd((a),*(dword*)&(b),(c)) 

struct SolidFiller : Rasterizer::Filler {
This may be significant in some usage scenarios as it can currently take e.g. 4.5 milliseconds to clear a 4K ImageBuffer before drawing to it. This can now be reduced to 2.8 milliseconds.

Now this is really interesting. Fill for RGBA* is actually one that is optimized for filling huge blocks. I will need to do some benchmarks...

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53758 is a reply to message #53757]

Tue, 28 April 2020 10:20

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Current Fill(RGBA * assembler code

4000EEE0  cmp r8d,byte +0x10 
4000EEE4  jl 0x14000ef13 
4000EEE6  movd xmm0,edx 
4000EEEA  pshufd xmm0,xmm0,0x0 
4000EEEF  nop 
4000EEF0  mov eax,r8d 
4000EEF3  movdqu [rcx],xmm0 
4000EEF7  movdqu [rcx+0x10],xmm0 
4000EEFC  movdqu [rcx+0x20],xmm0 
4000EF01  movdqu [rcx+0x30],xmm0 
4000EF06  add rcx,byte +0x40 
4000EF0A  lea r8d,[rax-0x10] 
4000EF0E  cmp eax,byte +0x1f 
4000EF11  jg 0x14000eef0 
4000EF13  add r8d,byte -0x1 
4000EF17  cmp r8d,byte +0xe 
4000EF1B  ja 0x14000ef59 
4000EF1D  lea r9,[rel 0x4000ef5c] 
4000EF24  movsxd rax,dword [r9+r8*4] 
4000EF28  add rax,r9 
4000EF2B  jmp rax 
4000EF2D  mov [rcx+0x38],edx 
4000EF30  mov [rcx+0x34],edx 
4000EF33  mov [rcx+0x30],edx 
4000EF36  mov [rcx+0x2c],edx 
4000EF39  mov [rcx+0x28],edx 
4000EF3C  mov [rcx+0x24],edx 
4000EF3F  mov [rcx+0x20],edx 
4000EF42  mov [rcx+0x1c],edx 
4000EF45  mov [rcx+0x18],edx 
4000EF48  mov [rcx+0x14],edx 
4000EF4B  mov [rcx+0x10],edx 
4000EF4E  mov [rcx+0xc],edx 
4000EF51  mov [rcx+0x8],edx 
4000EF54  mov [rcx+0x4],edx 
4000EF57  mov [rcx],edx 
4000EF59  ret

and the central snippet from the memsetd variant....

40001565  movaps xmm0,[rel 0x402c60a0] 
4000156C  nop dword [rax+0x0] 
40001570  movups [rsi+rdx*4],xmm0 
40001574  movups [rsi+rdx*4+0x10],xmm0 
40001579  movups [rsi+rdx*4+0x20],xmm0 
4000157E  movups [rsi+rdx*4+0x30],xmm0 
40001583  movups [rsi+rdx*4+0x40],xmm0 
40001588  movups [rsi+rdx*4+0x50],xmm0 
4000158D  movups [rsi+rdx*4+0x60],xmm0 
40001592  movups [rsi+rdx*4+0x70],xmm0 
40001597  movups [rsi+rdx*4+0x80],xmm0 
4000159F  movups [rsi+rdx*4+0x90],xmm0 
400015A7  movups [rsi+rdx*4+0xa0],xmm0 
400015AF  movups [rsi+rdx*4+0xb0],xmm0 
400015B7  movups [rsi+rdx*4+0xc0],xmm0 
400015BF  movups [rsi+rdx*4+0xd0],xmm0 
400015C7  movups [rsi+rdx*4+0xe0],xmm0 
400015CF  movups [rsi+rdx*4+0xf0],xmm0 
400015D7  add rdx,byte +0x40 
400015DB  add rdi,byte +0x8 
400015DF  jnz 0x140001570

Interesting...

Benchmarking code

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 1000; i++) {
		{
			RTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
		{
			RTIMING("Fill");
			Fill(b, c, len);
		}
	}
}

CLANGx64, 2700x

TIMING Fill : 2.73 s - 2.73 ms ( 2.73 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.78 s - 2.78 ms ( 2.78 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000

MSBT19x64

TIMING Fill : 2.89 s - 2.89 ms ( 2.89 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.90 s - 2.90 ms ( 2.90 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000

[Updated on: Tue, 28 April 2020 10:31]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53760 is a reply to message #53757]

Tue, 28 April 2020 10:27

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

Benchmarking and tuning is exactly what I did through yesterday (and beyond). I worked with both CLANGx64 and MSBT19x64. I worked out a bunch of optimized fillers until it turned out that memsetd() wins easily on large blocks and mostly on smaller blocks too. Especially on MSBT19x64 there does not seem to be a way to beat memsetd(). On CLANGx64 small transfer of one or two items was slightly faster, but on larger blocks memsetd() won again. Interestingly, CLANGx64 was a lot faster than MSBT19x64 for any of my own block transfer attempts, but still could not beat memsetd().

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53761 is a reply to message #53760]

Tue, 28 April 2020 10:33

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

I guess it might be CPU related... ?

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53763 is a reply to message #53761]

Tue, 28 April 2020 11:10

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Hm, MacOS 2,3 GHz Intel Core i5

TIMING Fill : 1.52 s - 1.52 ms ( 1.52 s / 1000 ), min: 1.00 ms, max: 2.00 ms, nesting: 0 - 1000
TIMING memsetd : 1.53 s - 1.53 ms ( 1.53 s / 1000 ), min: 1.00 ms, max: 12.00 ms, nesting: 0 - 1000

That's quite weird...

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53764 is a reply to message #53761]

Tue, 28 April 2020 11:17

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

Yes, CPU is likely the major player here. I took the liberty to modify TimingInspector to get finer granularity for timing using usecs(). The modified testcase can be found below. I get the following results on my Core i7 + Windows 10 professional x64. Now we can focus on the best round 'min:' to better avoid other tasks' effect. As you can see memsetd on MSBT19x64 is quite amazing performer.

MSBT19x64, Intel Core i7:

TIMING memsetd : 1.45 s - 1.45 ms ( 1.45 s / 1000 ), min: 1.15 ms, max: 5.25 ms, nesting: 0 - 1000
TIMING Fill : 3.73 s - 3.73 ms ( 3.73 s / 1000 ), min: 3.25 ms, max: 9.92 ms, nesting: 0 - 1000

CLANGx64, Intel Core i7:

TIMING memsetd : 3.85 s - 3.85 ms ( 3.85 s / 1000 ), min: 3.35 ms, max: 10.36 ms, nesting: 0 - 1000
TIMING Fill : 3.87 s - 3.87 ms ( 3.87 s / 1000 ), min: 3.38 ms, max: 11.33 ms, nesting: 0 - 1000

I guess that in my larger program the optimizations did not work this well as the Fill would have performed at around 5 ms level for this size of a buffer.

Anyway here's the modified benchmark.

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

class UTimingInspector {
protected:
	static bool active;

	const char *name;
	int         call_count;
	int64       total_time;
	int64       min_time;
	int64       max_time;
	int         max_nesting;
	int         all_count;
	StaticMutex mutex;

public:
	UTimingInspector(const char *name = NULL); // Not String !!!
	~UTimingInspector();

	void   Add(dword time, int nesting);

	String Dump();

	class Routine {
	public:
		Routine(UTimingInspector& stat, int& nesting)
		: nesting(nesting), stat(stat) {
			start_time = usecs();
			nesting++;
		}

		~Routine() {
			nesting--;
			stat.Add(start_time, nesting);
		}

	protected:
		int64 start_time;
		int& nesting;
		UTimingInspector& stat;
	};

	static void Activate(bool b)                    { active = b; }
};

bool UTimingInspector::active = true;

static UTimingInspector s_zero; // time of Start / End without actual body to measure

UTimingInspector::UTimingInspector(const char *_name) {
	name = _name ? _name : "";
	all_count = call_count = max_nesting = min_time = max_time = total_time = 0;
	static bool init;
	if(!init) {
#if defined(PLATFORM_WIN32) && !defined(PLATFORM_WINCE)
		timeBeginPeriod(1);
#endif
		init = true;
	}
}

UTimingInspector::~UTimingInspector() {
	if(this == &s_zero) return;
	Mutex::Lock __(mutex);
	StdLog() << Dump() << "\r\n";
}

void UTimingInspector::Add(dword time, int nesting)
{
	time = usecs() - time;
	Mutex::Lock __(mutex);
	if(!active) return;
	all_count++;
	if(nesting > max_nesting)
		max_nesting = nesting;
	if(nesting == 0) {
		total_time += time;
		if(call_count++ == 0)
			min_time = max_time = time;
		else {
			if(time < min_time)
				min_time = time;
			if(time > max_time)
				max_time = time;
		}
	}
}

String UTimingInspector::Dump() {
	Mutex::Lock __(mutex);
	String s = Sprintf("TIMING %-15s: ", name);
	if(call_count == 0)
		return s + "No active hit";
	ONCELOCK {
		int w = GetTickCount();
		while(GetTickCount() - w < 200) { // measure profiling overhead
			thread_local int nesting = 0;
			UTimingInspector::Routine __(s_zero, nesting);
		}
	}
	double tm = max(0.0, double(total_time) / call_count / 1000000 -
			             double(s_zero.total_time) / s_zero.call_count / 1000000);
	return s
	       + timeFormat(tm * call_count)
	       + " - " + timeFormat(tm)
	       + " (" + timeFormat((double)total_time  / 1000000) + " / "
	       + Sprintf("%d )", call_count)
		   + ", min: " + timeFormat((double)min_time / 1000000)
		   + ", max: " + timeFormat((double)max_time / 1000000)
		   + Sprintf(", nesting: %d - %d", max_nesting, all_count);
}

#define RUTIMING(x) \
	static UTimingInspector COMBINE(sTmStat, __LINE__)(x); \
	static thread_local int COMBINE(sTmStatNesting, __LINE__); \
	UTimingInspector::Routine COMBINE(sTmStatR, __LINE__)(COMBINE(sTmStat, __LINE__), COMBINE(sTmStatNesting, __LINE__))

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 1000; i++) {
		{
			RUTIMING("Fill");
			Fill(b, c, len);
		}
		{
			RUTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
	}
}

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53765 is a reply to message #53763]

Tue, 28 April 2020 11:27

Oblivion
Messages: 1266
Registered: August 2007

Senior Contributor

Hello,

A quick test on an older AMD FX 6100, six core processor. 3.2 GHZ (naturally, it is slower):

// GCC (x64, latest ver.)
TIMING Fill           :  7,53 s  -  7,53 ms ( 7,53 s  / 1000 ), min:  7,00 ms, max:  9,00 ms, nesting: 0 - 1000
TIMING memsetd        :  6,31 s  -  6,31 ms ( 6,31 s  / 1000 ), min:  6,00 ms, max: 18,00 ms, nesting: 0 - 1000
                                                                                     ----
// CLANG(x64, latest ver.)
 TIMING Fill           :  7,07 s  -  7,07 ms ( 7,07 s  / 1000 ), min:  6,00 ms, max: 10,00 ms, nesting: 0 - 1000
 TIMING memsetd        :  7,07 s  -  7,07 ms ( 7,08 s  / 1000 ), min:  6,00 ms, max: 17,00 ms, nesting: 0 - 1000
                                                                                     -----

Best regards,
Oblivion

Github page: https://github.com/ismail-yilmaz
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53913 is a reply to message #53751]

Fri, 15 May 2020 09:04

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Experimenting with parallel:

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

void CoFill(RGBA *t, RGBA c, int len)
{
	const int CHUNK = 1024;
	std::atomic<int> ii(0);
	CoDo([&] {
		for(;;) {
			int pos = CHUNK * ii++;
			if(pos >= len)
				break;
			Fill(t + pos, c, min(CHUNK, len - pos));
		}
	});
}

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 10; i++) {
		{
			RTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
		{
			RTIMING("Fill");
			Fill(b, c, len);
		}
		{
			RTIMING("CoFill");
			CoFill(b, c, len);
		}
	}
}

TIMING CoFill         : 19.00 ms -  1.90 ms (19.00 ms / 10 ), min:  1.00 ms, max:  3.00 ms, nesting: 0 - 10
TIMING Fill           : 31.00 ms -  3.10 ms (31.00 ms / 10 ), min:  3.00 ms, max:  4.00 ms, nesting: 0 - 10
TIMING memsetd        : 30.00 ms -  3.00 ms (30.00 ms / 10 ), min:  2.00 ms, max:  5.00 ms, nesting: 0 - 10

To try that on different CPU, Rapsberry PI 4 numbers:

TIMING CoFill         : 145.00 ms - 14.50 ms (145.00 ms / 10 ), min: 14.00 ms, max: 15.00 ms, nesting: 0 - 10
TIMING Fill           : 225.00 ms - 22.50 ms (225.00 ms / 10 ), min: 22.00 ms, max: 24.00 ms, nesting: 0 - 10
TIMING memsetd        : 184.00 ms - 18.40 ms (184.00 ms / 10 ), min: 11.00 ms, max: 77.00 ms, nesting: 0 - 10

[Updated on: Fri, 15 May 2020 10:18]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53914 is a reply to message #53913]

Fri, 15 May 2020 10:18

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi Mirek,

While interesting, I found that a plain memset() is way faster than memsetd() or Fill(). Just filling with 0xff (as the RGBA is for white) you will get a superior speed. I currently use memset() for a clear white on a ImageBuffer before giving it to BufferPainter. For more complex fill colors, I guess, the apex_memmove / memcpy code could be investigated for a more optimal result. (I posted a link to the apex code here on the forum briefly before release of 2020.1 Smile

Smile

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53915 is a reply to message #53914]

Fri, 15 May 2020 11:33

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Fri, 15 May 2020 10:18

While interesting, I found that a plain memset() is way faster than memsetd() or Fill(). Just filling with 0xff (as the RGBA is for white) you will get a superior speed. I currently use memset() for a clear white on a ImageBuffer before giving it to BufferPainter. For more complex fill colors, I guess, the apex_memmove / memcpy code could be investigated for a more optimal result. (I posted a link to the apex code here on the forum briefly before release of 2020.1

Best regards,

Tom

With CLANG, memset performance is about the same. However, with MSVC, it really is pretty damn fast.

I have digged into the code and the key ingredient seems to be MOVNTPS instruction, which means the code could be easily adapted to setting dwords. I just need to understand MT implications mentioned here:

https://www.felixcloutier.com/x86/movntps

It also might be questionable how this will affect the performance down the road (data not being in cache and everything...)

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53916 is a reply to message #53915]

Fri, 15 May 2020 11:41

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

At the time I was testing with the memset -- if I remember correctly -- on Windows + CLANG the memset with zero value was very efficient too, but the rest of the set values were slower. So, there must be some special optimized implementation for zeroing memory on CLANG too.

BR, Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53917 is a reply to message #53915]

Fri, 15 May 2020 11:47

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Here we go:

void SSEFill2(RGBA *t, RGBA c, int len)
{
	if(len >= 512) {
		while((uintptr_t)t & 63) { // align to cache line
			*t++ = c;
			len--;
		}
		dword m[4];
		m[0] = m[1] = m[2] = m[3] = *(dword*)&(c);
		__m128d val = _mm_loadu_pd((double *)m);
		while(len >= 16) {
			_mm_stream_pd((double *)t, val);
			_mm_stream_pd((double *)(t + 4), val);
			_mm_stream_pd((double *)(t + 8), val);
			_mm_stream_pd((double *)(t + 12), val);
			t += 16;
			len -= 16;
		}
		_mm_sfence();
	}

	Fill(t, c, len);
}

TIMING CoFill         : 42.00 ms -  2.10 ms (42.00 ms / 20 ), min:  1.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING SSEFill2       : 16.00 ms - 799.98 us (16.00 ms / 20 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 20
TIMING SSEFill        : 55.00 ms -  2.75 ms (55.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING Fill           : 56.00 ms -  2.80 ms (56.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING memsetd        : 52.00 ms -  2.60 ms (52.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53918 is a reply to message #53917]

Fri, 15 May 2020 12:08

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Laughing

And we have a winner!!

Also, please take a look at MSBT19 and MSBT19x64 for this too. It looks like this code only works with CLANG and CLANGx64 on Windows. (Have not checked on Linux yet.)
Additionally, plain memset, memsets and memsetd -variants would be useful for various tasks, as their efficiency varies depending on the compiler.

Thanks and best regards,

Tom

EDIT: I mean it does not compile on MSBT...

[Updated on: Fri, 15 May 2020 12:09]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53919 is a reply to message #53751]

Fri, 15 May 2020 12:16

Oblivion
Messages: 1266
Registered: August 2007

Senior Contributor

On linux with the relatively old AMD Athlon FX 6100.

Works with both GCC (9.3) and CLANG (10.0). Requires #include <smmintrin.h>:

GCC:

TIMING SSEFill2       : 43,99 ms -  4,40 ms (44,00 ms / 10 ), min:  4,00 ms, max:  5,00 ms, nesting: 0 - 10
TIMING CoFill         : 55,99 ms -  5,60 ms (56,00 ms / 10 ), min:  5,00 ms, max:  6,00 ms, nesting: 0 - 10
TIMING Fill           : 75,99 ms -  7,60 ms (76,00 ms / 10 ), min:  7,00 ms, max:  8,00 ms, nesting: 0 - 10
TIMING memsetd        : 66,99 ms -  6,70 ms (67,00 ms / 10 ), min:  5,00 ms, max: 17,00 ms, nesting: 0 - 10

CLANG:

TIMING SSEFill2       : 45,99 ms -  4,60 ms (46,00 ms / 10 ), min:  4,00 ms, max:  7,00 ms, nesting: 0 - 10
TIMING CoFill         : 55,99 ms -  5,60 ms (56,00 ms / 10 ), min:  5,00 ms, max:  6,00 ms, nesting: 0 - 10
TIMING Fill           : 65,99 ms -  6,60 ms (66,00 ms / 10 ), min:  6,00 ms, max: 10,00 ms, nesting: 0 - 10
TIMING memsetd        : 78,99 ms -  7,90 ms (79,00 ms / 10 ), min:  5,00 ms, max: 23,00 ms, nesting: 0 - 10

Github page: https://github.com/ismail-yilmaz
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat

[Updated on: Fri, 15 May 2020 12:27]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53920 is a reply to message #53919]

Fri, 15 May 2020 12:28

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

Thanks Oblivion; the #include <smmintrin.h> was exactly what was needed on Windows + CLANG too...

Here are the results for the 4k RGBA fill on Windows 10 x64 on Core i9:

MSBT19:
TIMING SSEFill2       :  1.30 s  -  1.30 ms ( 1.30 s  / 1000 ), min:  1.03 ms, max:  1.99 ms, nesting: 0 - 1000
TIMING Fill           :  1.13 s  -  1.13 ms ( 1.13 s  / 1000 ), min: 841.00 us, max:  3.04 ms, nesting: 0 - 1000

MSBT19x64:
TIMING SSEFill2       : 906.90 ms - 906.90 us (906.93 ms / 1000 ), min: 846.00 us, max:  1.67 ms, nesting: 0 - 1000
TIMING Fill           :  2.34 s  -  2.34 ms ( 2.34 s  / 1000 ), min:  2.21 ms, max:  4.69 ms, nesting: 0 - 1000

CLANG:
TIMING SSEFill2       : 935.97 ms - 935.97 us (936.02 ms / 1000 ), min: 854.00 us, max:  1.67 ms, nesting: 0 - 1000
TIMING Fill           :  2.44 s  -  2.44 ms ( 2.44 s  / 1000 ), min:  2.25 ms, max:  4.74 ms, nesting: 0 - 1000

CLANGx64:
TIMING SSEFill2       : 934.45 ms - 934.45 us (934.47 ms / 1000 ), min: 854.00 us, max:  1.77 ms, nesting: 0 - 1000
TIMING Fill           :  2.20 s  -  2.20 ms ( 2.20 s  / 1000 ), min:  1.98 ms, max:  5.97 ms, nesting: 0 - 1000

Looks very good indeed! MSBT19 on the other hand looks surprising...

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53922 is a reply to message #53918]

Fri, 15 May 2020 13:15

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Fri, 15 May 2020 12:08

Additionally, plain memset, memsets and memsetd -variants would be useful for various tasks, as their efficiency varies depending on the compiler.

What about this:

void FillCacheLines(void *cache_aligned_ptr, void *data16, int count)
{
	dword *t = (dword *)cache_aligned_ptr;
	__m128d val = _mm_loadu_pd((double *)data16);
	dword *e = t + 16 * count;
	while(t < e) {
		_mm_stream_pd((double *)t, val);
		_mm_stream_pd((double *)(t + 4), val);
		_mm_stream_pd((double *)(t + 8), val);
		_mm_stream_pd((double *)(t + 12), val);
		t += 16;
	}
	_mm_sfence();
}

template <class T>
void MemSet(void *dest, T data, int len)
{
	static_assert(sizeof(T) == 1 || sizeof(T) == 2 || sizeof(T) == 4 || sizeof(T) == 8 || sizeof(T) == 16, "invalid sizeof");
	T *t = (T *)dest;
	if(len * sizeof(T) > 550) {
		while((uintptr_t)t & 63) { // align to cache line
			*t++ = data;
			len--;
		}
		const int itemn = 16 / sizeof(T);
		const int per_cache_line = 4 * itemn;
		T m[itemn];
		for(int i = 0; i < itemn; i++)
			m[i] = data;
		int count = len / per_cache_line;
		FillCacheLines(t, m, count);
		len -= per_cache_line * count;
	}
	
	while(len >= 16) {
		t[0] = data; t[1] = data; t[2] = data; t[3] = data;
		t[4] = data; t[5] = data; t[6] = data; t[7] = data;
		t[8] = data; t[9] = data; t[10] = data; t[11] = data;
		t[12] = data; t[13] = data; t[14] = data; t[15] = data;
		t += 16;
		len -= 16;
	}
	switch(len) {
	case 15: t[14] = data;
	case 14: t[13] = data;
	case 13: t[12] = data;
	case 12: t[11] = data;
	case 11: t[10] = data;
	case 10: t[9] = data;
	case 9: t[8] = data;
	case 8: t[7] = data;
	case 7: t[6] = data;
	case 6: t[5] = data;
	case 5: t[4] = data;
	case 4: t[3] = data;
	case 3: t[2] = data;
	case 2: t[1] = data;
	case 1: t[0] = data;
	}
}

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53923 is a reply to message #53922]

Fri, 15 May 2020 13:36

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Mirek,

Yes, absolutely beautiful!

The results for the set including the new MemSet() on Win10x64 on Core i9 are:

MSBT19:

TIMING MemSet         : 831.06 ms - 831.06 us (831.13 ms / 1000 ), min: 779.00 us, max:  1.72 ms, nesting: 0 - 1000
TIMING SSEFill2       :  1.21 s  -  1.21 ms ( 1.21 s  / 1000 ), min:  1.00 ms, max:  2.19 ms, nesting: 0 - 1000
TIMING Fill           : 915.70 ms - 915.70 us (915.76 ms / 1000 ), min: 859.00 us, max:  3.49 ms, nesting: 0 - 1000

MSBT19x64:

TIMING MemSet         : 818.33 ms - 818.33 us (818.36 ms / 1000 ), min: 777.00 us, max:  1.71 ms, nesting: 0 - 1000
TIMING SSEFill2       : 899.74 ms - 899.74 us (899.77 ms / 1000 ), min: 854.00 us, max:  1.78 ms, nesting: 0 - 1000
TIMING Fill           :  2.29 s  -  2.29 ms ( 2.29 s  / 1000 ), min:  2.21 ms, max:  4.51 ms, nesting: 0 - 1000

CLANG:

TIMING MemSet         : 835.39 ms - 835.39 us (835.45 ms / 1000 ), min: 790.00 us, max:  1.51 ms, nesting: 0 - 1000
TIMING SSEFill2       : 918.63 ms - 918.63 us (918.68 ms / 1000 ), min: 872.00 us, max:  1.47 ms, nesting: 0 - 1000
TIMING Fill           :  2.36 s  -  2.36 ms ( 2.36 s  / 1000 ), min:  2.28 ms, max:  5.45 ms, nesting: 0 - 1000

CLANGx64:

TIMING MemSet         : 838.86 ms - 838.86 us (838.89 ms / 1000 ), min: 787.00 us, max:  1.70 ms, nesting: 0 - 1000
TIMING SSEFill2       : 921.49 ms - 921.49 us (921.51 ms / 1000 ), min: 870.00 us, max:  1.84 ms, nesting: 0 - 1000
TIMING Fill           :  2.10 s  -  2.10 ms ( 2.10 s  / 1000 ), min:  2.01 ms, max:  5.00 ms, nesting: 0 - 1000

I trust you can now make all the different fillers through U++ to use this new code... right?

Thanks and best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53934 is a reply to message #53923]

Fri, 15 May 2020 23:13

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi Mirek,

The game is not over yet, I'm afraid. I did some additional benchmarking with varying buffer lengths to set. It get's more complicated...

	RGBA c = Red();
	
	int bsize=8*1024*1024;
	Buffer<RGBA> b(bsize,(RGBA)Blue());

	String result="\"N\",\"Fill()\",\"memsetd()\",\"MemSet()\"\r\n";
	for(int len=1;len<=bsize;len*=2){
		int maximum=1000000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++) Fill(~b, c, len);
		int64 t1=usecs(t0);
		t0=usecs();
		for(int i = 0; i < maximum; i++) memsetd(~b, *(dword*)&(c), len);
		int64 t2=usecs(t0);
		t0=usecs();
		for(int i = 0; i < maximum; i++) MemSet(~b, c, len);
		int64 t3=usecs(t0);
		result.Cat(Format("%d,%f,%f,%f\r\n",len,1000.0*t1/maximum,1000.0*t2/maximum,1000.0*t3/maximum));
	}
	
	SaveFile(GetHomeDirFile("Desktop/memset.csv"),result);

Now, if you import the resulting memset.csv to your spreadsheet program and create a log-log plot, you will see that the different buffer lengths have a huge impact on the performance of each algorithm. As filling lengths can be quite diverse, I think we need to think about some combination of the different algorithms. Additionally, we need to look at the results on different CPUs. I will keep tinkering on this one for a while here.

(Now I'm running on Core i7 here at home, so this one I can test easily, and also the Core i9 at the office every now and then, as the situation is what it is...)

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53935 is a reply to message #53923]

Fri, 15 May 2020 23:45

Didier
Messages: 740
Registered: November 2008
Location: France

Contributor

Here is what I get on my Linux and Ryzen 2700
Du to unstable results with 10 loops, I also placed results for 1000 loops

The new MemSet() is definitly really a good addition Smile

Smile

==== CLANG X64 ====
TIMING MemSet : 10.00 ms - 999.98 us (10.00 ms / 10 ), min: 1.00 ms, max: 1.00 ms, nesting: 0 - 10
TIMING SSEFill2 : 12.00 ms - 1.20 ms (12.00 ms / 10 ), min: 1.00 ms, max: 2.00 ms, nesting: 0 - 10
TIMING CoFill : 21.00 ms - 2.10 ms (21.00 ms / 10 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 3.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING memsetd : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 9.00 ms, nesting: 0 - 10

TIMING MemSet : 833.97 ms - 833.97 us (834.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING SSEFill2 : 870.97 ms - 870.97 us (871.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING CoFill : 1.88 s - 1.88 ms ( 1.88 s / 1000 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 1000
TIMING Fill : 2.90 s - 2.90 ms ( 2.90 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.51 s - 2.51 ms ( 2.51 s / 1000 ), min: 2.00 ms, max: 10.00 ms, nesting: 0 - 1000

==== GCC X64 ====
TIMING MemSet : 7.00 ms - 699.98 us ( 7.00 ms / 10 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 10
TIMING SSEFill2 : 9.00 ms - 899.98 us ( 9.00 ms / 10 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 10
TIMING CoFill : 23.00 ms - 2.30 ms (23.00 ms / 10 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 10
TIMING memsetd : 35.00 ms - 3.50 ms (35.00 ms / 10 ), min: 2.00 ms, max: 10.00 ms, nesting: 0 - 10

TIMING MemSet : 820.98 ms - 820.98 us (821.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING SSEFill2 : 877.98 ms - 877.98 us (878.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING CoFill : 1.85 s - 1.85 ms ( 1.85 s / 1000 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 1000
TIMING Fill : 2.97 s - 2.97 ms ( 2.97 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.52 s - 2.52 ms ( 2.52 s / 1000 ), min: 2.00 ms, max: 8.00 ms, nesting: 0 - 1000

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53936 is a reply to message #53935]

Sat, 16 May 2020 01:59

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

I've worked on optimizing the new_memsetd() operation through various buffer sizes up to 8M and here's the best I can come up with (at least this night...). With CLANG it seems to be beneficial to use the Mirek's new MemSet() for buffer sizes above about 1M, but below that and also with MSBT19 / MSBT19x64 the result is better without. (This algorithm is especially efficient with small fills and therefore should work well as a BufferPainter filler too.) For best results, there are separate versions for 32-bit and 64-bit code. (The '#ifdef WIN64' obviously only works on Windows, but I think there was some other flag on Linux for detecting a 64-bit environment. Please apply that flag, whatever it is, if you test on Linux, etc...)

#ifdef WIN64

inline void new_memsetd(dword *t, dword data, int len){
#ifdef COMPILER_CLANG
	if(len>1024*1024){
		MemSet(t,data,len);
		return;
	}
#endif
	if(len&1) *t++=data;
	len>>=1;

	uint64 *w=(uint64*)t;
	uint64 q=data;
	q = (q << 32) | data;
	
	switch(len) {
		default:{
			uint64 *lim = w + len;
			while(w < lim) *w++ = q;
			break;
		}
		case 16: w[15] = q;
		case 15: w[14] = q;
		case 14: w[13] = q;
		case 13: w[12] = q;
		case 12: w[11] = q;
		case 11: w[10] = q;
		case 10: w[9] = q;
		case 9: w[8] = q;
		case 8: w[7] = q;
		case 7: w[6] = q;
		case 6: w[5] = q;
		case 5: w[4] = q;
		case 4: w[3] = q;
		case 3: w[2] = q;
		case 2: w[1] = q;
		case 1: w[0] = q;
	}
}

#else

inline void new_memsetd(dword *t, dword data, int len){
#ifdef COMPILER_CLANG
	if(len>1024*1024){
		MemSet(t,data,len);
		return;
	}
#endif
	switch(len) {
		default:{
			dword *lim = t + len;
			while(t < lim) *t++ = data;
			break;
		}
		case 16: t[15] = data;
		case 15: t[14] = data;
		case 14: t[13] = data;
		case 13: t[12] = data;
		case 12: t[11] = data;
		case 11: t[10] = data;
		case 10: t[9] = data;
		case 9: t[8] = data;
		case 8: t[7] = data;
		case 7: t[6] = data;
		case 6: t[5] = data;
		case 5: t[4] = data;
		case 4: t[3] = data;
		case 3: t[2] = data;
		case 2: t[1] = data;
		case 1: t[0] = data;
	}
}

#endif

The updated benchmarking code:

	RGBA c = Red();
	
	int bsize=8*1024*1024;
	Buffer<RGBA> b(bsize,(RGBA)Blue());

	String result="\"N\",\"Fill()\",\"new_memsetd()\",\"MemSet()\"\r\n";
	for(int len=1;len<=bsize;){
		int maximum=100000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++) Fill(~b, c, len);
		int64 t1=usecs();
		for(int i = 0; i < maximum; i++) new_memsetd((dword*)~b, *(dword*)&(c), len);
		int64 t2=usecs();
		for(int i = 0; i < maximum; i++) MemSet(~b, c, len);
		int64 t3=usecs();
		result.Cat(Format("%d,%f,%f,%f\r\n",len,1000.0*(t1-t0)/maximum,1000.0*(t2-t1)/maximum,1000.0*(t3-t2)/maximum));
		if(len<32) len++;
		else len*=2;
	}
	
	SaveFile(GetHomeDirFile("Desktop/memset.csv"),result);

Again, I suggest you plot your results using a log-log chart to clearly see the performance with all different block sizes.

If you have some time to spare, please let me know how this works for you.

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53946 is a reply to message #53935]

Sun, 17 May 2020 00:10

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

Interestingly, the FillRGBA() that can be found in the current BufferPainter Fillers, is a real performer. It wins below 1M dwords just about anything else. However, Mireks new MemSet is the winner thereafter. This applies on Windows 10 x64 to CLANG/CLANGx64/MSBT19 on my Core i7. Only MSBT19x64 has a different situation and the following code tries to optimize that, in addition to combining FillRGBA and MemSet for the other compilers:

#if defined(WIN64) && defined(COMPILER_MSC)

// for MSBT19x64 only:
inline void new_memsetd(void *b, dword data, int len){
	dword *t=(dword *)b;
	switch(len){
		case 6: t[5] = data;
		case 5: t[4] = data;
		case 4: t[3] = data;
		case 3: t[2] = data;
		case 2: t[1] = data;
		case 1: t[0] = data;
		case 0: return;
		
		default:{
			if(len&1) *t++=data;
			len>>=1;
		
			uint64 *w=(uint64*)t;
			uint64 q=*(dword*)&data;
			q |= (q << 32);
			
			switch(len) {
				default:{
					uint64 *lim = w + len - 32;
					while(w < lim) *w++ = q;
				}
				case 32: w[31] = q;
				case 31: w[30] = q;
				case 30: w[29] = q;
				case 29: w[28] = q;
				case 28: w[27] = q;
				case 27: w[26] = q;
				case 26: w[25] = q;
				case 25: w[24] = q;
				case 24: w[23] = q;
				case 23: w[22] = q;
				case 22: w[21] = q;
				case 21: w[20] = q;
				case 20: w[19] = q;
				case 19: w[18] = q;
				case 18: w[17] = q;
				case 17: w[16] = q;
				case 16: w[15] = q;
				case 15: w[14] = q;
				case 14: w[13] = q;
				case 13: w[12] = q;
				case 12: w[11] = q;
				case 11: w[10] = q;
				case 10: w[9] = q;
				case 9: w[8] = q;
				case 8: w[7] = q;
				case 7: w[6] = q;
				case 6: w[5] = q;
				case 5: w[4] = q;
				case 4: w[3] = q;
				case 3:	w[2] = q;
				case 2: w[1] = q;
				case 1:	w[0] = q;
			}
		}
	}
}

#else

inline void new_memsetd(void *b, dword data, int len){
	if(len<=1024*1024) FillRGBA((RGBA*)b,*(RGBA*)&data,len);
	else MemSet(b,data,len);
}

#endif

The benchmarking code for various fill sizes now looks like this:

	RGBA c = Red();
	
	int bsize=8*1024*1024;
	Buffer<RGBA> b(bsize,(RGBA)Blue());

	String result="\"N\",\"Fill()\",\"new_memsetd()\",\"MemSet()\",\"FillRGBA()\"\r\n";
	for(int len=1;len<=bsize;){
		int maximum=100000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++) Fill(~b, c, len);
		int64 t1=usecs();
		for(int i = 0; i < maximum; i++) new_memsetd(~b, *(dword*)&c, len);
		int64 t2=usecs();
		for(int i = 0; i < maximum; i++) MemSet(~b, c, len);
		int64 t3=usecs();
		for(int i = 0; i < maximum; i++) FillRGBA(~b, c, len);
		int64 t4=usecs();
		result.Cat(Format("%d,%f,%f,%f,%f\r\n",len,1000.0*(t1-t0)/maximum,1000.0*(t2-t1)/maximum,1000.0*(t3-t2)/maximum,1000.0*(t4-t3)/maximum));
		if(len<64) len++;
		else len*=2;
	}
	
	SaveFile(GetHomeDirFile("Desktop/memset.csv"),result);

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53947 is a reply to message #53936]

Sun, 17 May 2020 08:47

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Sat, 16 May 2020 01:59

With CLANG it seems to be beneficial to use the Mirek's new MemSet() for buffer sizes above about 1M

I guess L2 cache size plays a role here. The new trick bypasses the cache so kicks in when cache is exhausted...

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53948 is a reply to message #53947]

Sun, 17 May 2020 10:01

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

mirek wrote on Sun, 17 May 2020 09:47

Tom1 wrote on Sat, 16 May 2020 01:59
With CLANG it seems to be beneficial to use the Mirek's new MemSet() for buffer sizes above about 1M

I guess L2 cache size plays a role here. The new trick bypasses the cache so kicks in when cache is exhausted...

Mirek

Hi,

Where can I find the new trick?

BR,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53951 is a reply to message #53948]

Sun, 17 May 2020 15:49

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Ah, by "trick" I mean using using non-temporal move instruction which we have found in memset....

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53952 is a reply to message #53951]

Sun, 17 May 2020 18:05

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

What about this:

#include <CtrlLib/CtrlLib.h>
#include <smmintrin.h>

using namespace Upp;

void Fill0(RGBA *t, RGBA c, int len)
{
	while(len >= 16) {
		t[0] = c; t[1] = c; t[2] = c; t[3] = c;
		t[4] = c; t[5] = c; t[6] = c; t[7] = c;
		t[8] = c; t[9] = c; t[10] = c; t[11] = c;
		t[12] = c; t[13] = c; t[14] = c; t[15] = c;
		t += 16;
		len -= 16;
	}
	switch(len & 15) {
	case 15: t[14] = c;
	case 14: t[13] = c;
	case 13: t[12] = c;
	case 12: t[11] = c;
	case 11: t[10] = c;
	case 10: t[9] = c;
	case 9: t[8] = c;
	case 8: t[7] = c;
	case 7: t[6] = c;
	case 6: t[5] = c;
	case 5: t[4] = c;
	case 4: t[3] = c;
	case 3: t[2] = c;
	case 2: t[1] = c;
	case 1: t[0] = c;
	}
}

void Fill2(RGBA *t, RGBA c, int len)
{
	while(len >= 16) {
		t[0] = c; t[1] = c; t[2] = c; t[3] = c;
		t[4] = c; t[5] = c; t[6] = c; t[7] = c;
		t[8] = c; t[9] = c; t[10] = c; t[11] = c;
		t[12] = c; t[13] = c; t[14] = c; t[15] = c;
		t += 16;
		len -= 16;
	}
	if(len & 8) {
		t[0] = t[1] = t[2] = t[3] = t[4] = t[5] = t[6] = t[7] = c;
		t += 8;
	}
	if(len & 4) {
		t[0] = t[1] = t[2] = t[3] = c;
		t += 4;
	}
	if(len & 2) {
		t[0] = t[1] = c;
		t += 2;
	}
	if(len & 1)
		t[0] = c;
}

void Fill3(RGBA *t, RGBA c, int len)
{
	dword m[4];
	m[0] = m[1] = m[2] = m[3] = *(dword*)&(c);
	__m128d val = _mm_loadu_pd((double *)m);
	if(len >= 16) {
		if(len > 1024*1024 / 16 && ((uintptr_t)t & 3) == 0) { // for really huge data, bypass the cache
			while((uintptr_t)t & 15) { // align to 16 bytes for SSE
				*t++ = c;
				len--;
			}
			do {
				_mm_stream_pd((double *)t, val);
				_mm_stream_pd((double *)(t + 4), val);
				_mm_stream_pd((double *)(t + 8), val);
				_mm_stream_pd((double *)(t + 12), val);
				t += 16;
				len -= 16;
			}
			while(len >= 16);
			_mm_sfence();
		}
		else
			do {
				_mm_storeu_pd((double *)t, val);
				_mm_storeu_pd((double *)(t + 4), val);
				_mm_storeu_pd((double *)(t + 8), val);
				_mm_storeu_pd((double *)(t + 12), val);
				t += 16;
				len -= 16;
			}
			while(len >= 16);
	}
	if(len & 8) {
		_mm_storeu_pd((double *)t, val);
		_mm_storeu_pd((double *)(t + 4), val);
		t += 8;
	}
	if(len & 4) {
		_mm_storeu_pd((double *)t, val);
		t += 4;
	}
	if(len & 2) {
		t[0] = t[1] = c;
		t += 2;
	}
	if(len & 1)
		t[0] = c;
}

int len = 2000 * 4000;

GUI_APP_MAIN
{
	Color c = Red();
	
	Buffer<RGBA> b(2000);
	
	Vector<int> rnd;
	for(int i = 0; i < 200; i++)
		rnd << Random(100);

	for(int i = 0; i < 1000000; i++) {
		{
			RTIMING("memsetd");
			for(int i = 0; i < rnd.GetCount(); i += 2)
				memsetd(b + rnd[i], *(dword*)&(c), rnd[i + 1]);
		}
		{
			RTIMING("Fill");
			for(int i = 0; i < rnd.GetCount(); i += 2)
				Fill(b + rnd[i], c, rnd[i + 1]);
		}
		{
			RTIMING("Fill0");
			for(int i = 0; i < rnd.GetCount(); i += 2)
				Fill0(b + rnd[i], c, rnd[i + 1]);
		}
		{
			RTIMING("Fill2");
			for(int i = 0; i < rnd.GetCount(); i += 2)
				Fill2(b + rnd[i], c, rnd[i + 1]);
		}
		{
			RTIMING("Fill3");
			for(int i = 0; i < rnd.GetCount(); i += 2)
				Fill3(b + rnd[i], c, rnd[i + 1]);
		}
		{
			RTIMING("memset");
			for(int i = 0; i < rnd.GetCount(); i += 2)
				memset(b + 4 * rnd[i], 255, 4 * rnd[i + 1]);
		}
	}

	b.Alloc(len);

	for(int i = 0; i < 20; i++) {
		memsetd(b, *(dword*)&(c), len);
		{
			RTIMING("HUGE memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
		{
			RTIMING("HUGE Fill");
			Fill(b, c, len);
		}
		{
			RTIMING("HUGE Fill3");
			Fill3(b, c, len);
		}
		{
			RTIMING("HUGE memset");
			memset(b, c, len * 4);
		}
	}

	BeepExclamation();
}

I believe Fill3 does not have any weakness here... Actually, CLANG produced almost exactly the same code for Fill2 and memsetd for small fills, but I guess providing SSE2 implementation directly does not hurt anything. Plus we still like to have that cache bypass...

So I would go for Fill3 for X86 and Fill2 for non-X86 (in the hope it gets optimized for neon on ARM...)

Mirek

[Updated on: Sun, 17 May 2020 18:09]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53953 is a reply to message #53952]

Sun, 17 May 2020 20:56

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi Mirek,

Here are my results:

CLANG

TIMING HUGE memset    : 27.28 ms -  1.36 ms (27.29 ms / 20 ), min:  1.25 ms, max:  1.66 ms, nesting: 0 - 20
TIMING HUGE Fill3     : 35.01 ms -  1.75 ms (35.01 ms / 20 ), min:  1.63 ms, max:  1.99 ms, nesting: 0 - 20
TIMING HUGE Fill      : 73.74 ms -  3.69 ms (73.75 ms / 20 ), min:  3.32 ms, max:  7.43 ms, nesting: 0 - 20
TIMING HUGE memsetd   : 72.88 ms -  3.64 ms (72.88 ms / 20 ), min:  3.40 ms, max:  4.51 ms, nesting: 0 - 20
TIMING memset         :  1.01 s  -  1.01 us ( 1.07 s  / 1000000 ), min:  1.00 us, max: 28.00 us, nesting: 0 - 1000000
TIMING Fill3          : 505.44 ms - 505.44 ns (565.98 ms / 1000000 ), min:  0.00 ns, max: 29.00 us, nesting: 0 - 1000000
TIMING Fill2          : 497.06 ms - 497.06 ns (557.61 ms / 1000000 ), min:  0.00 ns, max: 28.00 us, nesting: 0 - 1000000
TIMING Fill0          : 772.53 ms - 772.53 ns (833.07 ms / 1000000 ), min:  0.00 ns, max: 63.00 us, nesting: 0 - 1000000
TIMING Fill           :  1.67 s  -  1.67 us ( 1.73 s  / 1000000 ), min:  1.00 us, max: 58.00 us, nesting: 0 - 1000000
TIMING memsetd        : 495.67 ms - 495.67 ns (556.22 ms / 1000000 ), min:  0.00 ns, max: 28.00 us, nesting: 0 - 1000000

CLANGx64

TIMING HUGE memset    : 27.76 ms -  1.39 ms (27.76 ms / 20 ), min:  1.28 ms, max:  1.80 ms, nesting: 0 - 20
TIMING HUGE Fill3     : 36.31 ms -  1.82 ms (36.31 ms / 20 ), min:  1.59 ms, max:  2.27 ms, nesting: 0 - 20
TIMING HUGE Fill      : 73.42 ms -  3.67 ms (73.42 ms / 20 ), min:  3.41 ms, max:  4.74 ms, nesting: 0 - 20
TIMING HUGE memsetd   : 74.52 ms -  3.73 ms (74.52 ms / 20 ), min:  3.47 ms, max:  4.22 ms, nesting: 0 - 20
TIMING memset         : 898.49 ms - 898.49 ns (925.83 ms / 1000000 ), min:  0.00 ns, max: 52.00 us, nesting: 0 - 1000000
TIMING Fill3          : 492.59 ms - 492.59 ns (519.92 ms / 1000000 ), min:  0.00 ns, max: 32.00 us, nesting: 0 - 1000000
TIMING Fill2          : 495.82 ms - 495.82 ns (523.15 ms / 1000000 ), min:  0.00 ns, max: 28.00 us, nesting: 0 - 1000000
TIMING Fill0          : 569.61 ms - 569.61 ns (596.95 ms / 1000000 ), min:  0.00 ns, max: 41.00 us, nesting: 0 - 1000000
TIMING Fill           : 591.56 ms - 591.56 ns (618.90 ms / 1000000 ), min:  0.00 ns, max: 30.00 us, nesting: 0 - 1000000
TIMING memsetd        : 549.04 ms - 549.04 ns (576.37 ms / 1000000 ), min:  0.00 ns, max: 65.00 us, nesting: 0 - 1000000

MSBT19

TIMING HUGE memset    : 26.51 ms -  1.33 ms (26.51 ms / 20 ), min:  1.26 ms, max:  1.49 ms, nesting: 0 - 20
TIMING HUGE Fill3     : 35.42 ms -  1.77 ms (35.42 ms / 20 ), min:  1.58 ms, max:  2.14 ms, nesting: 0 - 20
TIMING HUGE Fill      : 25.47 ms -  1.27 ms (25.48 ms / 20 ), min:  1.18 ms, max:  1.59 ms, nesting: 0 - 20
TIMING HUGE memsetd   : 25.12 ms -  1.26 ms (25.12 ms / 20 ), min:  1.15 ms, max:  1.59 ms, nesting: 0 - 20
TIMING memset         : 978.21 ms - 978.21 ns ( 1.05 s  / 1000000 ), min:  1.00 us, max: 29.00 us, nesting: 0 - 1000000
TIMING Fill3          :  1.50 s  -  1.50 us ( 1.58 s  / 1000000 ), min:  1.00 us, max: 29.00 us, nesting: 0 - 1000000
TIMING Fill2          :  1.89 s  -  1.89 us ( 1.96 s  / 1000000 ), min:  1.00 us, max: 34.00 us, nesting: 0 - 1000000
TIMING Fill0          :  2.02 s  -  2.02 us ( 2.09 s  / 1000000 ), min:  1.00 us, max: 33.00 us, nesting: 0 - 1000000
TIMING Fill           :  2.06 s  -  2.06 us ( 2.14 s  / 1000000 ), min:  1.00 us, max: 32.00 us, nesting: 0 - 1000000
TIMING memsetd        :  1.62 s  -  1.62 us ( 1.69 s  / 1000000 ), min:  1.00 us, max: 45.00 us, nesting: 0 - 1000000

MSBT19x64

TIMING HUGE memset    : 26.96 ms -  1.35 ms (26.96 ms / 20 ), min:  1.27 ms, max:  1.90 ms, nesting: 0 - 20
TIMING HUGE Fill3     : 35.07 ms -  1.75 ms (35.08 ms / 20 ), min:  1.62 ms, max:  2.02 ms, nesting: 0 - 20
TIMING HUGE Fill      : 67.09 ms -  3.35 ms (67.09 ms / 20 ), min:  3.17 ms, max:  3.60 ms, nesting: 0 - 20
TIMING HUGE memsetd   : 25.64 ms -  1.28 ms (25.64 ms / 20 ), min:  1.19 ms, max:  1.48 ms, nesting: 0 - 20
TIMING memset         : 818.75 ms - 818.75 ns (856.11 ms / 1000000 ), min:  0.00 ns, max: 31.00 us, nesting: 0 - 1000000
TIMING Fill3          :  1.36 s  -  1.36 us ( 1.40 s  / 1000000 ), min:  1.00 us, max: 31.00 us, nesting: 0 - 1000000
TIMING Fill2          :  1.67 s  -  1.67 us ( 1.70 s  / 1000000 ), min:  1.00 us, max: 30.00 us, nesting: 0 - 1000000
TIMING Fill0          :  1.66 s  -  1.66 us ( 1.70 s  / 1000000 ), min:  1.00 us, max: 46.00 us, nesting: 0 - 1000000
TIMING Fill           :  1.68 s  -  1.68 us ( 1.72 s  / 1000000 ), min:  1.00 us, max: 50.00 us, nesting: 0 - 1000000
TIMING memsetd        :  1.50 s  -  1.50 us ( 1.54 s  / 1000000 ), min:  1.00 us, max: 29.00 us, nesting: 0 - 1000000

Fill3 is generally the best, but I experience two issues behind the scenes of this benchmark:

1. On MSBT19 / MSBT19x64 there is a significant penalty for small counts. It results in 5 ns per call, whereas in CLANG it is about 0.8 - 1.0 ns per call.
2. On MSBT19x64 the optimal threshold size is 2M counts on my Core i7. However, interestingly the default threshold value works better with MSBT19 on the same computer.

I will continue to investigate this.

Thanks and best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53954 is a reply to message #53953]

Sun, 17 May 2020 21:46

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Mirek,

Also, please check my previous new_memsetd() above using MSBT19x64 for reference. Preferably also with short transfers (1-64).

Best regards,

Tom

[Updated on: Sun, 17 May 2020 21:46]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53955 is a reply to message #53751]

Sun, 17 May 2020 22:00

Oblivion
Messages: 1266
Registered: August 2007

Senior Contributor

Here's results on (AMD FX, linux x64):

GCC:

TIMING HUGE memset    : 113,98 ms -  5,70 ms (114,00 ms / 20 ), min:  5,00 ms, max:  6,00 ms, nesting: 0 - 20
TIMING HUGE Fill3     : 81,98 ms -  4,10 ms (82,00 ms / 20 ), min:  4,00 ms, max:  5,00 ms, nesting: 0 - 20
TIMING HUGE Fill      : 145,98 ms -  7,30 ms (146,00 ms / 20 ), min:  7,00 ms, max:  8,00 ms, nesting: 0 - 20
TIMING HUGE memsetd   : 125,98 ms -  6,30 ms (126,00 ms / 20 ), min:  6,00 ms, max:  7,00 ms, nesting: 0 - 20
TIMING memset         :  1,24 s  -  1,24 us ( 2,23 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING Fill3          :  2,01 s  -  2,01 us ( 3,01 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING Fill2          :  2,43 s  -  2,43 us ( 3,42 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING Fill0          :  2,77 s  -  2,77 us ( 3,76 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING Fill           :  2,72 s  -  2,72 us ( 3,71 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING memsetd        :  1,80 s  -  1,80 us ( 2,80 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000

ClANG:

TIMING HUGE memset    : 120,98 ms -  6,05 ms (121,00 ms / 20 ), min:  5,00 ms, max:  7,00 ms, nesting: 0 - 20
TIMING HUGE Fill3     : 81,98 ms -  4,10 ms (82,00 ms / 20 ), min:  4,00 ms, max:  5,00 ms, nesting: 0 - 20
TIMING HUGE Fill      : 130,98 ms -  6,55 ms (131,00 ms / 20 ), min:  6,00 ms, max:  7,00 ms, nesting: 0 - 20
TIMING HUGE memsetd   : 133,98 ms -  6,70 ms (134,00 ms / 20 ), min:  6,00 ms, max:  7,00 ms, nesting: 0 - 20
TIMING memset         :  1,49 s  -  1,49 us ( 2,39 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING Fill3          :  1,74 s  -  1,74 us ( 2,64 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING Fill2          :  1,62 s  -  1,62 us ( 2,52 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING Fill0          :  2,00 s  -  2,00 us ( 2,90 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING Fill           :  2,06 s  -  2,06 us ( 2,96 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000
TIMING memsetd        :  2,18 s  -  2,18 us ( 3,08 s  / 1000000 ), min:  0,00 ns, max:  1,00 ms, nesting: 0 - 1000000

Best regards,
Oblivion

Github page: https://github.com/ismail-yilmaz
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53957 is a reply to message #53953]

Sun, 17 May 2020 23:25

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Mirek,

Please check out this one. It features better performance on MSBT19 / MSBT19x64 with low counts, and works well on CLANG/CLANGx64 too:

inline void new_memset128(void *b, dword data, int len){
	
	switch(len){
		case 4: ((dword *)b)[3] = data;
		case 3: ((dword *)b)[2] = data;
		case 2: ((dword *)b)[1] = data;
		case 1: ((dword *)b)[0] = data;
		case 0: return;
	}
	
	__m128i q = _mm_set1_epi32(*(int*)&data);
	__m128i *w = (__m128i*)b;
	
	switch(len>>2){
		default:{
			__m128i *e = (__m128i*)b + (len>>2) - 4;
			if(len <= 2*1024*1024){
				while(w<e){
					_mm_storeu_si128(w++, q);
					_mm_storeu_si128(w++, q);
					_mm_storeu_si128(w++, q);
					_mm_storeu_si128(w++, q);
				}
			}
			else{
				while(w<e){
					_mm_stream_si128(w++, q);
					_mm_stream_si128(w++, q);
					_mm_stream_si128(w++, q);
					_mm_stream_si128(w++, q);
				}
			}
		}
		case 4: _mm_storeu_si128(w++, q);
		case 3: _mm_storeu_si128(w++, q);
		case 2: _mm_storeu_si128(w++, q);
		case 1: _mm_storeu_si128(w++, q);
	}
	switch(len&3){
		case 3: ((dword *)b)[len-3] = data;
		case 2: ((dword *)b)[len-2] = data;
		case 1: ((dword *)b)[len-1] = data;
	}
}

Best regards,

Tom

EDIT: Fine tuning...

[Updated on: Sun, 17 May 2020 23:55]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53960 is a reply to message #53957]

Mon, 18 May 2020 10:16

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Sun, 17 May 2020 23:25

Mirek,

Please check out this one. It features better performance on MSBT19 / MSBT19x64 with low counts, and works well on CLANG/CLANGx64 too:

I think there are 2 issues:

- Cache bypass starts at 8MB.
- Missing alignment adjustment for cache bypass.
- I might be wrong, but why is there " - 4": __m128i *e = (__m128i*)t + (len>>2) - 4; ?

But yes, it hits something for MSC compiler... Smile

Smile

Mirek

[Updated on: Mon, 18 May 2020 10:25]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53961 is a reply to message #53960]

Mon, 18 May 2020 11:13

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

Yes, you're right: The alignment should be handled. I'll take a look at it... (just need to minimize the code size in order to avoid penalty for short transfers. It is extremely sensitive.)

The cache limit is intentionally 8MB as this is the sweet spot for my Core i7. Probably should get this value from the system to optimize the correct threshold.

The -4 compensates the rest of the samples handled within the leaked default in the switch. (The below cases do the trick).

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53962 is a reply to message #53961]

Mon, 18 May 2020 13:31

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Hi,

Alignment corrected. (Caused obviously a lot of rearranging things to obtain balance.) Threshold is still at 8M, but feel free to experiment.

inline void new_memset128(void *b, dword data, int len){
	switch(len){
		case 5: ((dword *)b)[4] = data;
		case 4: ((dword *)b)[3] = data;
		case 3: ((dword *)b)[2] = data;
		case 2: ((dword *)b)[1] = data;
		case 1: ((dword *)b)[0] = data;
		case 0: return;
	}
	
	__m128i q = _mm_set1_epi32(*(int*)&data);
	__m128i *w = (__m128i*)b;
	__m128i *e = (__m128i*)b + (len>>2);

	if(len <= 2*1024*1024 || ((uintptr_t)b&3)){
		while(w<e-1){
			_mm_storeu_si128(w++, q);
			_mm_storeu_si128(w++, q);
		}
		if(w<e) _mm_storeu_si128(w++, q);
	}
	else{
		int s=(-((int)((uintptr_t)b)>>2))&0x3;
		switch(s){
			case 3: ((dword *)b)[2] = data;
			case 2: ((dword *)b)[1] = data;
			case 1: ((dword *)b)[0] = data;
		}
		
		w = (__m128i*) ((dword*)b + s);
		
		while(w<e) _mm_stream_si128(w++, q);
		_mm_sfence();
	}

	switch(len&3){
		case 3: ((dword *)b)[len-3] = data;
		case 2: ((dword *)b)[len-2] = data;
		case 1: ((dword *)b)[len-1] = data;
	}
}

Best regards,

Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53963 is a reply to message #53961]

Mon, 18 May 2020 13:33

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

So I kept digging and found that the reason why Fill3 performed great with CLANG and less great with MSC was that CLANG understand what I mean with that ugly array based code to fill 4 color values into the xmm register while MSC really created that 'm' array, stored 4 values into memory and then fetched them into xmm... Smile

Smile

Fixed Fill3 seems to perform well with MSC too:

void Fill3(RGBA *t, RGBA c, int len)
{
	__m128i val4 = _mm_set1_epi32(*(int*)&c);
	auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
	auto Set4S = [&](int at) { _mm_stream_si128((__m128i *)(t + at), val4); };
	if(len >= 16) {
		if(len > 1024*1024 / 16 && ((uintptr_t)t & 3) == 0) { // for really huge data, bypass the cache
			while((uintptr_t)t & 15) { // align to 16 bytes for SSE
				*t++ = c;
				len--;
			}
			do {
				Set4S(0);
				Set4S(4);
				Set4S(8);
				Set4S(12);
				t += 16;
				len -= 16;
			}
			while(len >= 16);
			_mm_sfence();
		}
		else
			do {
				Set4(0);
				Set4(4);
				Set4(8);
				Set4(12);
				t += 16;
				len -= 16;
			}
			while(len >= 16);
	}
	if(len & 8) {
		Set4(0);
		Set4(4);
		t += 8;
	}
	if(len & 4) {
		Set4(0);
		t += 4;
	}
	if(len & 2) {
		t[0] = t[1] = c;
		t += 2;
	}
	if(len & 1)
		t[0] = c;
}

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53964 is a reply to message #53751]

Mon, 18 May 2020 13:43

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

This variation of basically the same thing seems a tiny bit faster:

void Fill3a(RGBA *t, RGBA c, int len)
{
	__m128i val4 = _mm_set1_epi32(*(int*)&c);
	auto Set4 = [&](int at) { _mm_storeu_si128((__m128i *)(t + at), val4); };
	auto Set4S = [&](int at) { _mm_stream_si128((__m128i *)(t + at), val4); };
	if(len >= 32) {
		if(len > 1024*1024 / 16 && ((uintptr_t)t & 3) == 0) { // for really huge data, bypass the cache
			while((uintptr_t)t & 15) { // align to 16 bytes for SSE
				*t++ = c;
				len--;
			}
			do {
				Set4S(0); Set4S(4); Set4S(8); Set4S(12);
				Set4S(16); Set4S(20); Set4S(24); Set4S(28);
				t += 32;
				len -= 32;
			}
			while(len >= 32);
			_mm_sfence();
		}
		else
			do {
				Set4(0); Set4(4); Set4(8); Set4(12);
				Set4(16); Set4(20); Set4(24); Set4(28);
				t += 32;
				len -= 32;
			}
			while(len >= 32);
	}
	if(len & 16) {
		Set4(0); Set4(4); Set4(8); Set4(12);
		t += 16;
	}
	if(len & 8) {
		Set4(0); Set4(4);
		t += 8;
	}
	if(len & 4) {
		Set4(0);
		t += 4;
	}
	if(len & 2) {
		t[0] = t[1] = c;
		t += 2;
	}
	if(len & 1)
		t[0] = c;
}

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53965 is a reply to message #53962]

Mon, 18 May 2020 13:53

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

You can actually do alignment without branching there (that I got from studying memset code Smile

Smile

. I guess that is the last thing to try now Smile

Smile

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53966 is a reply to message #53965]

Mon, 18 May 2020 16:06

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

mirek wrote on Mon, 18 May 2020 14:53

You can actually do alignment without branching there (that I got from studying memset code . I guess that is the last thing to try now

Hi,

Sounds good, but seems hard to squeeze speed from ... (tried quite a while now).

BR, Tom

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53967 is a reply to message #53966]

Mon, 18 May 2020 17:08

Tom1
Messages: 1319
Registered: March 2007

Ultimate Contributor

Mirek,

Here it is: The unconditional alignment. I took your idea, ditched my own, and modified your Fill3a as follows:

void inline Fill3T(void *b, dword data, int len){
	switch(len){
		case 3: ((dword *)b)[2] = data;
		case 2: ((dword *)b)[1] = data;
		case 1: ((dword *)b)[0] = data;
		case 0: return;
	}
	__m128i q = _mm_set1_epi32(*(int*)&data);
	__m128i *w = (__m128i*)b;
	
	if(len >= 32) {
		__m128i *e = (__m128i*)b + (len>>2) - 8;
		if(len > 1024*1024 / 16 && ((uintptr_t)w & 3) == 0) { // for really huge data, bypass the cache
			_mm_storeu_si128(w, q); // Head align
			int s=(-((int)((uintptr_t)b)>>2))&0x3;
			w = (__m128i*) ((dword*)b) + s;
			do {
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
				_mm_stream_si128(w++, q);
			}while(w<=e);
			_mm_sfence();
		}
		else
			do {
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
				_mm_storeu_si128(w++, q);
			}while(w<=e);
	}
	
	if(len & 16) {
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
	}
	if(len & 8) {
		_mm_storeu_si128(w++, q);
		_mm_storeu_si128(w++, q);
	}
	if(len & 4) {
		_mm_storeu_si128(w, q);
	}
	_mm_storeu_si128((__m128i*) (((dword*)b) + len - 4), q); // Tail align
}

I made some other changes too and this one is slightly faster on short transfers while equals Fill3a() on longer ones. The improvement is more significant on MSBT19 / MSBT19x64.

In order to get real fast short transfers, the function must be 'inline'. I think this necessitates two variants of the final function. (I have seen that BufferPainter paints most of the time with really short fills, so inlining really makes a difference there.)

Best regards,

Tom

P.S. My cache threshold is still at 8M...

[Updated on: Mon, 18 May 2020 17:11]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53968 is a reply to message #53967]

Mon, 18 May 2020 18:12

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Well, my full idea was to align for len >= 32 always and MAYBE have some benefit from the fact that stores are now aligned (even perhaps use aligned version). Sources diverge on actuall performance, but it might be around 10%. In any case, MSC memset does this...

Mirek

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53970 is a reply to message #53967]

Mon, 18 May 2020 18:28

mirek
Messages: 14291
Registered: November 2005

Ultimate Member

Tom1 wrote on Mon, 18 May 2020 17:08

In order to get real fast short transfers, the function must be 'inline'. I think this necessitates two variants of the final function. (I have seen that BufferPainter paints most of the time with really short fills, so inlining really makes a difference there.)

Well, CLANG inlines all Fill3 variants without me asking him to do it, so I guess I have zero problems to have it in the header...

Quote:

P.S. My cache threshold is still at 8M...

What are your CPU L1/L2/L3 caches?

What happens if you move that to 1M, 12M, 16M? (I mean, how much penalty you get?)

Mirek

Report message to a moderator

Pages (3): [1 2 3 › »]

Previous Topic:	Should we still care about big-endian CPUs?
Next Topic:	TheIDE crash after switching package

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

PDF

]

Current Time: Wed May 13 06:13:35 GMT+2 2026

Total time taken to generate the page: 0.01317 seconds