Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)
SSE2 and SVO optimization (Painter, memcpy....) [message #53751] Mon, 27 April 2020 19:19 Go to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

Here's an optimization for BufferPainter.

BufferPainter::Clear(RGBA) speed is improved by over 30 % with the following change in Painter/Render.cpp:
void BufferPainter::ClearOp(const RGBA& color)
{
//	UPP::Fill(~*ip, color, ip->GetLength());
	FillRGBA(~*ip, color, ip->GetLength());
	ip->SetKind(color.a == 255 ? IMAGE_OPAQUE : IMAGE_ALPHA);
}


And in Painter/Fillers.h:
namespace Upp {

// Add the following line:
#define FillRGBA(a,b,c) memsetd((a),*(dword*)&(b),(c)) 

struct SolidFiller : Rasterizer::Filler {


This may be significant in some usage scenarios as it can currently take e.g. 4.5 milliseconds to clear a 4K ImageBuffer before drawing to it. This can now be reduced to 2.8 milliseconds.

Best regards,

Tom

EDIT: Changed code to use the newly optimized FillRGBA() found in Fillers.h. This can be found at:
https://www.ultimatepp.org/forums/index.php?t=msg&th=110 11&goto=53752&#msg_53752

[Updated on: Sun, 24 May 2020 10:22] by Moderator

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53757 is a reply to message #53751] Tue, 28 April 2020 10:12 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Mon, 27 April 2020 19:19
Hi,

Here's an optimization for BufferPainter.

BufferPainter::Clear(RGBA) speed is improved by over 30 % with the following change in Painter/Render.cpp:
void BufferPainter::ClearOp(const RGBA& color)
{
//	UPP::Fill(~*ip, color, ip->GetLength());
	FillRGBA(~*ip, color, ip->GetLength());
	ip->SetKind(color.a == 255 ? IMAGE_OPAQUE : IMAGE_ALPHA);
}


And in Painter/Fillers.h:
namespace Upp {

// Add the following line:
#define FillRGBA(a,b,c) memsetd((a),*(dword*)&(b),(c)) 

struct SolidFiller : Rasterizer::Filler {


This may be significant in some usage scenarios as it can currently take e.g. 4.5 milliseconds to clear a 4K ImageBuffer before drawing to it. This can now be reduced to 2.8 milliseconds.


Now this is really interesting. Fill for RGBA* is actually one that is optimized for filling huge blocks. I will need to do some benchmarks...
Re: BufferPainter::Clear() optimization [message #53758 is a reply to message #53757] Tue, 28 April 2020 10:20 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Current Fill(RGBA * assembler code

4000EEE0  cmp r8d,byte +0x10 
4000EEE4  jl 0x14000ef13 
4000EEE6  movd xmm0,edx 
4000EEEA  pshufd xmm0,xmm0,0x0 
4000EEEF  nop 
4000EEF0  mov eax,r8d 
4000EEF3  movdqu [rcx],xmm0 
4000EEF7  movdqu [rcx+0x10],xmm0 
4000EEFC  movdqu [rcx+0x20],xmm0 
4000EF01  movdqu [rcx+0x30],xmm0 
4000EF06  add rcx,byte +0x40 
4000EF0A  lea r8d,[rax-0x10] 
4000EF0E  cmp eax,byte +0x1f 
4000EF11  jg 0x14000eef0 
4000EF13  add r8d,byte -0x1 
4000EF17  cmp r8d,byte +0xe 
4000EF1B  ja 0x14000ef59 
4000EF1D  lea r9,[rel 0x4000ef5c] 
4000EF24  movsxd rax,dword [r9+r8*4] 
4000EF28  add rax,r9 
4000EF2B  jmp rax 
4000EF2D  mov [rcx+0x38],edx 
4000EF30  mov [rcx+0x34],edx 
4000EF33  mov [rcx+0x30],edx 
4000EF36  mov [rcx+0x2c],edx 
4000EF39  mov [rcx+0x28],edx 
4000EF3C  mov [rcx+0x24],edx 
4000EF3F  mov [rcx+0x20],edx 
4000EF42  mov [rcx+0x1c],edx 
4000EF45  mov [rcx+0x18],edx 
4000EF48  mov [rcx+0x14],edx 
4000EF4B  mov [rcx+0x10],edx 
4000EF4E  mov [rcx+0xc],edx 
4000EF51  mov [rcx+0x8],edx 
4000EF54  mov [rcx+0x4],edx 
4000EF57  mov [rcx],edx 
4000EF59  ret 


and the central snippet from the memsetd variant....

40001565  movaps xmm0,[rel 0x402c60a0] 
4000156C  nop dword [rax+0x0] 
40001570  movups [rsi+rdx*4],xmm0 
40001574  movups [rsi+rdx*4+0x10],xmm0 
40001579  movups [rsi+rdx*4+0x20],xmm0 
4000157E  movups [rsi+rdx*4+0x30],xmm0 
40001583  movups [rsi+rdx*4+0x40],xmm0 
40001588  movups [rsi+rdx*4+0x50],xmm0 
4000158D  movups [rsi+rdx*4+0x60],xmm0 
40001592  movups [rsi+rdx*4+0x70],xmm0 
40001597  movups [rsi+rdx*4+0x80],xmm0 
4000159F  movups [rsi+rdx*4+0x90],xmm0 
400015A7  movups [rsi+rdx*4+0xa0],xmm0 
400015AF  movups [rsi+rdx*4+0xb0],xmm0 
400015B7  movups [rsi+rdx*4+0xc0],xmm0 
400015BF  movups [rsi+rdx*4+0xd0],xmm0 
400015C7  movups [rsi+rdx*4+0xe0],xmm0 
400015CF  movups [rsi+rdx*4+0xf0],xmm0 
400015D7  add rdx,byte +0x40 
400015DB  add rdi,byte +0x8 
400015DF  jnz 0x140001570 


Interesting...

Benchmarking code

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 1000; i++) {
		{
			RTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
		{
			RTIMING("Fill");
			Fill(b, c, len);
		}
	}
}


CLANGx64, 2700x

TIMING Fill : 2.73 s - 2.73 ms ( 2.73 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.78 s - 2.78 ms ( 2.78 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000

MSBT19x64

TIMING Fill : 2.89 s - 2.89 ms ( 2.89 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.90 s - 2.90 ms ( 2.90 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000

[Updated on: Tue, 28 April 2020 10:31]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53760 is a reply to message #53757] Tue, 28 April 2020 10:27 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

Benchmarking and tuning is exactly what I did through yesterday (and beyond). I worked with both CLANGx64 and MSBT19x64. I worked out a bunch of optimized fillers until it turned out that memsetd() wins easily on large blocks and mostly on smaller blocks too. Especially on MSBT19x64 there does not seem to be a way to beat memsetd(). On CLANGx64 small transfer of one or two items was slightly faster, but on larger blocks memsetd() won again. Interestingly, CLANGx64 was a lot faster than MSBT19x64 for any of my own block transfer attempts, but still could not beat memsetd().

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #53761 is a reply to message #53760] Tue, 28 April 2020 10:33 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
I guess it might be CPU related... ?
Re: BufferPainter::Clear() optimization [message #53763 is a reply to message #53761] Tue, 28 April 2020 11:10 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Hm, MacOS 2,3 GHz Intel Core i5

TIMING Fill : 1.52 s - 1.52 ms ( 1.52 s / 1000 ), min: 1.00 ms, max: 2.00 ms, nesting: 0 - 1000
TIMING memsetd : 1.53 s - 1.53 ms ( 1.53 s / 1000 ), min: 1.00 ms, max: 12.00 ms, nesting: 0 - 1000

That's quite weird...

Mirek
Re: BufferPainter::Clear() optimization [message #53764 is a reply to message #53761] Tue, 28 April 2020 11:17 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

Yes, CPU is likely the major player here. I took the liberty to modify TimingInspector to get finer granularity for timing using usecs(). The modified testcase can be found below. I get the following results on my Core i7 + Windows 10 professional x64. Now we can focus on the best round 'min:' to better avoid other tasks' effect. As you can see memsetd on MSBT19x64 is quite amazing performer.

MSBT19x64, Intel Core i7:

TIMING memsetd : 1.45 s - 1.45 ms ( 1.45 s / 1000 ), min: 1.15 ms, max: 5.25 ms, nesting: 0 - 1000
TIMING Fill : 3.73 s - 3.73 ms ( 3.73 s / 1000 ), min: 3.25 ms, max: 9.92 ms, nesting: 0 - 1000

CLANGx64, Intel Core i7:

TIMING memsetd : 3.85 s - 3.85 ms ( 3.85 s / 1000 ), min: 3.35 ms, max: 10.36 ms, nesting: 0 - 1000
TIMING Fill : 3.87 s - 3.87 ms ( 3.87 s / 1000 ), min: 3.38 ms, max: 11.33 ms, nesting: 0 - 1000

I guess that in my larger program the optimizations did not work this well as the Fill would have performed at around 5 ms level for this size of a buffer.

Anyway here's the modified benchmark.

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

class UTimingInspector {
protected:
	static bool active;

	const char *name;
	int         call_count;
	int64       total_time;
	int64       min_time;
	int64       max_time;
	int         max_nesting;
	int         all_count;
	StaticMutex mutex;

public:
	UTimingInspector(const char *name = NULL); // Not String !!!
	~UTimingInspector();

	void   Add(dword time, int nesting);

	String Dump();

	class Routine {
	public:
		Routine(UTimingInspector& stat, int& nesting)
		: nesting(nesting), stat(stat) {
			start_time = usecs();
			nesting++;
		}

		~Routine() {
			nesting--;
			stat.Add(start_time, nesting);
		}

	protected:
		int64 start_time;
		int& nesting;
		UTimingInspector& stat;
	};

	static void Activate(bool b)                    { active = b; }
};

bool UTimingInspector::active = true;

static UTimingInspector s_zero; // time of Start / End without actual body to measure

UTimingInspector::UTimingInspector(const char *_name) {
	name = _name ? _name : "";
	all_count = call_count = max_nesting = min_time = max_time = total_time = 0;
	static bool init;
	if(!init) {
#if defined(PLATFORM_WIN32) && !defined(PLATFORM_WINCE)
		timeBeginPeriod(1);
#endif
		init = true;
	}
}

UTimingInspector::~UTimingInspector() {
	if(this == &s_zero) return;
	Mutex::Lock __(mutex);
	StdLog() << Dump() << "\r\n";
}

void UTimingInspector::Add(dword time, int nesting)
{
	time = usecs() - time;
	Mutex::Lock __(mutex);
	if(!active) return;
	all_count++;
	if(nesting > max_nesting)
		max_nesting = nesting;
	if(nesting == 0) {
		total_time += time;
		if(call_count++ == 0)
			min_time = max_time = time;
		else {
			if(time < min_time)
				min_time = time;
			if(time > max_time)
				max_time = time;
		}
	}
}

String UTimingInspector::Dump() {
	Mutex::Lock __(mutex);
	String s = Sprintf("TIMING %-15s: ", name);
	if(call_count == 0)
		return s + "No active hit";
	ONCELOCK {
		int w = GetTickCount();
		while(GetTickCount() - w < 200) { // measure profiling overhead
			thread_local int nesting = 0;
			UTimingInspector::Routine __(s_zero, nesting);
		}
	}
	double tm = max(0.0, double(total_time) / call_count / 1000000 -
			             double(s_zero.total_time) / s_zero.call_count / 1000000);
	return s
	       + timeFormat(tm * call_count)
	       + " - " + timeFormat(tm)
	       + " (" + timeFormat((double)total_time  / 1000000) + " / "
	       + Sprintf("%d )", call_count)
		   + ", min: " + timeFormat((double)min_time / 1000000)
		   + ", max: " + timeFormat((double)max_time / 1000000)
		   + Sprintf(", nesting: %d - %d", max_nesting, all_count);
}

#define RUTIMING(x) \
	static UTimingInspector COMBINE(sTmStat, __LINE__)(x); \
	static thread_local int COMBINE(sTmStatNesting, __LINE__); \
	UTimingInspector::Routine COMBINE(sTmStatR, __LINE__)(COMBINE(sTmStat, __LINE__), COMBINE(sTmStatNesting, __LINE__))

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 1000; i++) {
		{
			RUTIMING("Fill");
			Fill(b, c, len);
		}
		{
			RUTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
	}
}


Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #53765 is a reply to message #53763] Tue, 28 April 2020 11:27 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 1092
Registered: August 2007
Senior Contributor
Hello,

A quick test on an older AMD FX 6100, six core processor. 3.2 GHZ (naturally, it is slower):

// GCC (x64, latest ver.)
TIMING Fill           :  7,53 s  -  7,53 ms ( 7,53 s  / 1000 ), min:  7,00 ms, max:  9,00 ms, nesting: 0 - 1000
TIMING memsetd        :  6,31 s  -  6,31 ms ( 6,31 s  / 1000 ), min:  6,00 ms, max: 18,00 ms, nesting: 0 - 1000
                                                                                     ----
// CLANG(x64, latest ver.)
 TIMING Fill           :  7,07 s  -  7,07 ms ( 7,07 s  / 1000 ), min:  6,00 ms, max: 10,00 ms, nesting: 0 - 1000
 TIMING memsetd        :  7,07 s  -  7,07 ms ( 7,08 s  / 1000 ), min:  6,00 ms, max: 17,00 ms, nesting: 0 - 1000
                                                                                     -----


Best regards,
Oblivion


Re: BufferPainter::Clear() optimization [message #53913 is a reply to message #53751] Fri, 15 May 2020 09:04 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Experimenting with parallel:

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

void CoFill(RGBA *t, RGBA c, int len)
{
	const int CHUNK = 1024;
	std::atomic<int> ii(0);
	CoDo([&] {
		for(;;) {
			int pos = CHUNK * ii++;
			if(pos >= len)
				break;
			Fill(t + pos, c, min(CHUNK, len - pos));
		}
	});
}

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 10; i++) {
		{
			RTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
		{
			RTIMING("Fill");
			Fill(b, c, len);
		}
		{
			RTIMING("CoFill");
			CoFill(b, c, len);
		}
	}
}


TIMING CoFill         : 19.00 ms -  1.90 ms (19.00 ms / 10 ), min:  1.00 ms, max:  3.00 ms, nesting: 0 - 10
TIMING Fill           : 31.00 ms -  3.10 ms (31.00 ms / 10 ), min:  3.00 ms, max:  4.00 ms, nesting: 0 - 10
TIMING memsetd        : 30.00 ms -  3.00 ms (30.00 ms / 10 ), min:  2.00 ms, max:  5.00 ms, nesting: 0 - 10


To try that on different CPU, Rapsberry PI 4 numbers:

TIMING CoFill         : 145.00 ms - 14.50 ms (145.00 ms / 10 ), min: 14.00 ms, max: 15.00 ms, nesting: 0 - 10
TIMING Fill           : 225.00 ms - 22.50 ms (225.00 ms / 10 ), min: 22.00 ms, max: 24.00 ms, nesting: 0 - 10
TIMING memsetd        : 184.00 ms - 18.40 ms (184.00 ms / 10 ), min: 11.00 ms, max: 77.00 ms, nesting: 0 - 10

[Updated on: Fri, 15 May 2020 10:18]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53914 is a reply to message #53913] Fri, 15 May 2020 10:18 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi Mirek,

While interesting, I found that a plain memset() is way faster than memsetd() or Fill(). Just filling with 0xff (as the RGBA is for white) you will get a superior speed. I currently use memset() for a clear white on a ImageBuffer before giving it to BufferPainter. For more complex fill colors, I guess, the apex_memmove / memcpy code could be investigated for a more optimal result. (I posted a link to the apex code here on the forum briefly before release of 2020.1 Smile

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #53915 is a reply to message #53914] Fri, 15 May 2020 11:33 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Fri, 15 May 2020 10:18

While interesting, I found that a plain memset() is way faster than memsetd() or Fill(). Just filling with 0xff (as the RGBA is for white) you will get a superior speed. I currently use memset() for a clear white on a ImageBuffer before giving it to BufferPainter. For more complex fill colors, I guess, the apex_memmove / memcpy code could be investigated for a more optimal result. (I posted a link to the apex code here on the forum briefly before release of 2020.1 Smile

Best regards,

Tom


With CLANG, memset performance is about the same. However, with MSVC, it really is pretty damn fast.

I have digged into the code and the key ingredient seems to be MOVNTPS instruction, which means the code could be easily adapted to setting dwords. I just need to understand MT implications mentioned here:

https://www.felixcloutier.com/x86/movntps

It also might be questionable how this will affect the performance down the road (data not being in cache and everything...)

Mirek
Re: BufferPainter::Clear() optimization [message #53916 is a reply to message #53915] Fri, 15 May 2020 11:41 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
At the time I was testing with the memset -- if I remember correctly -- on Windows + CLANG the memset with zero value was very efficient too, but the rest of the set values were slower. So, there must be some special optimized implementation for zeroing memory on CLANG too.

BR, Tom
Re: BufferPainter::Clear() optimization [message #53917 is a reply to message #53915] Fri, 15 May 2020 11:47 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Here we go:

void SSEFill2(RGBA *t, RGBA c, int len)
{
	if(len >= 512) {
		while((uintptr_t)t & 63) { // align to cache line
			*t++ = c;
			len--;
		}
		dword m[4];
		m[0] = m[1] = m[2] = m[3] = *(dword*)&(c);
		__m128d val = _mm_loadu_pd((double *)m);
		while(len >= 16) {
			_mm_stream_pd((double *)t, val);
			_mm_stream_pd((double *)(t + 4), val);
			_mm_stream_pd((double *)(t + 8), val);
			_mm_stream_pd((double *)(t + 12), val);
			t += 16;
			len -= 16;
		}
		_mm_sfence();
	}

	Fill(t, c, len);
}


TIMING CoFill         : 42.00 ms -  2.10 ms (42.00 ms / 20 ), min:  1.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING SSEFill2       : 16.00 ms - 799.98 us (16.00 ms / 20 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 20
TIMING SSEFill        : 55.00 ms -  2.75 ms (55.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING Fill           : 56.00 ms -  2.80 ms (56.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING memsetd        : 52.00 ms -  2.60 ms (52.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20
Re: BufferPainter::Clear() optimization [message #53918 is a reply to message #53917] Fri, 15 May 2020 12:08 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Laughing And we have a winner!!

Also, please take a look at MSBT19 and MSBT19x64 for this too. It looks like this code only works with CLANG and CLANGx64 on Windows. (Have not checked on Linux yet.)
Additionally, plain memset, memsets and memsetd -variants would be useful for various tasks, as their efficiency varies depending on the compiler.

Thanks and best regards,

Tom

EDIT: I mean it does not compile on MSBT...

[Updated on: Fri, 15 May 2020 12:09]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53919 is a reply to message #53751] Fri, 15 May 2020 12:16 Go to previous messageGo to next message
Oblivion is currently offline  Oblivion
Messages: 1092
Registered: August 2007
Senior Contributor
On linux with the relatively old AMD Athlon FX 6100.

Works with both GCC (9.3) and CLANG (10.0). Requires #include <smmintrin.h>:

GCC:
TIMING SSEFill2       : 43,99 ms -  4,40 ms (44,00 ms / 10 ), min:  4,00 ms, max:  5,00 ms, nesting: 0 - 10
TIMING CoFill         : 55,99 ms -  5,60 ms (56,00 ms / 10 ), min:  5,00 ms, max:  6,00 ms, nesting: 0 - 10
TIMING Fill           : 75,99 ms -  7,60 ms (76,00 ms / 10 ), min:  7,00 ms, max:  8,00 ms, nesting: 0 - 10
TIMING memsetd        : 66,99 ms -  6,70 ms (67,00 ms / 10 ), min:  5,00 ms, max: 17,00 ms, nesting: 0 - 10



CLANG:
TIMING SSEFill2       : 45,99 ms -  4,60 ms (46,00 ms / 10 ), min:  4,00 ms, max:  7,00 ms, nesting: 0 - 10
TIMING CoFill         : 55,99 ms -  5,60 ms (56,00 ms / 10 ), min:  5,00 ms, max:  6,00 ms, nesting: 0 - 10
TIMING Fill           : 65,99 ms -  6,60 ms (66,00 ms / 10 ), min:  6,00 ms, max: 10,00 ms, nesting: 0 - 10
TIMING memsetd        : 78,99 ms -  7,90 ms (79,00 ms / 10 ), min:  5,00 ms, max: 23,00 ms, nesting: 0 - 10


[Updated on: Fri, 15 May 2020 12:27]

Report message to a moderator

Re: BufferPainter::Clear() optimization [message #53920 is a reply to message #53919] Fri, 15 May 2020 12:28 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

Thanks Oblivion; the #include <smmintrin.h> was exactly what was needed on Windows + CLANG too...

Here are the results for the 4k RGBA fill on Windows 10 x64 on Core i9:
MSBT19:
TIMING SSEFill2       :  1.30 s  -  1.30 ms ( 1.30 s  / 1000 ), min:  1.03 ms, max:  1.99 ms, nesting: 0 - 1000
TIMING Fill           :  1.13 s  -  1.13 ms ( 1.13 s  / 1000 ), min: 841.00 us, max:  3.04 ms, nesting: 0 - 1000

MSBT19x64:
TIMING SSEFill2       : 906.90 ms - 906.90 us (906.93 ms / 1000 ), min: 846.00 us, max:  1.67 ms, nesting: 0 - 1000
TIMING Fill           :  2.34 s  -  2.34 ms ( 2.34 s  / 1000 ), min:  2.21 ms, max:  4.69 ms, nesting: 0 - 1000

CLANG:
TIMING SSEFill2       : 935.97 ms - 935.97 us (936.02 ms / 1000 ), min: 854.00 us, max:  1.67 ms, nesting: 0 - 1000
TIMING Fill           :  2.44 s  -  2.44 ms ( 2.44 s  / 1000 ), min:  2.25 ms, max:  4.74 ms, nesting: 0 - 1000

CLANGx64:
TIMING SSEFill2       : 934.45 ms - 934.45 us (934.47 ms / 1000 ), min: 854.00 us, max:  1.77 ms, nesting: 0 - 1000
TIMING Fill           :  2.20 s  -  2.20 ms ( 2.20 s  / 1000 ), min:  1.98 ms, max:  5.97 ms, nesting: 0 - 1000


Looks very good indeed! MSBT19 on the other hand looks surprising...

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #53922 is a reply to message #53918] Fri, 15 May 2020 13:15 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Fri, 15 May 2020 12:08

Additionally, plain memset, memsets and memsetd -variants would be useful for various tasks, as their efficiency varies depending on the compiler.


What about this:

void FillCacheLines(void *cache_aligned_ptr, void *data16, int count)
{
	dword *t = (dword *)cache_aligned_ptr;
	__m128d val = _mm_loadu_pd((double *)data16);
	dword *e = t + 16 * count;
	while(t < e) {
		_mm_stream_pd((double *)t, val);
		_mm_stream_pd((double *)(t + 4), val);
		_mm_stream_pd((double *)(t + 8), val);
		_mm_stream_pd((double *)(t + 12), val);
		t += 16;
	}
	_mm_sfence();
}

template <class T>
void MemSet(void *dest, T data, int len)
{
	static_assert(sizeof(T) == 1 || sizeof(T) == 2 || sizeof(T) == 4 || sizeof(T) == 8 || sizeof(T) == 16, "invalid sizeof");
	T *t = (T *)dest;
	if(len * sizeof(T) > 550) {
		while((uintptr_t)t & 63) { // align to cache line
			*t++ = data;
			len--;
		}
		const int itemn = 16 / sizeof(T);
		const int per_cache_line = 4 * itemn;
		T m[itemn];
		for(int i = 0; i < itemn; i++)
			m[i] = data;
		int count = len / per_cache_line;
		FillCacheLines(t, m, count);
		len -= per_cache_line * count;
	}
	
	while(len >= 16) {
		t[0] = data; t[1] = data; t[2] = data; t[3] = data;
		t[4] = data; t[5] = data; t[6] = data; t[7] = data;
		t[8] = data; t[9] = data; t[10] = data; t[11] = data;
		t[12] = data; t[13] = data; t[14] = data; t[15] = data;
		t += 16;
		len -= 16;
	}
	switch(len) {
	case 15: t[14] = data;
	case 14: t[13] = data;
	case 13: t[12] = data;
	case 12: t[11] = data;
	case 11: t[10] = data;
	case 10: t[9] = data;
	case 9: t[8] = data;
	case 8: t[7] = data;
	case 7: t[6] = data;
	case 6: t[5] = data;
	case 5: t[4] = data;
	case 4: t[3] = data;
	case 3: t[2] = data;
	case 2: t[1] = data;
	case 1: t[0] = data;
	}
}

Re: BufferPainter::Clear() optimization [message #53923 is a reply to message #53922] Fri, 15 May 2020 13:36 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Mirek,

Yes, absolutely beautiful!

The results for the set including the new MemSet() on Win10x64 on Core i9 are:

MSBT19:

TIMING MemSet         : 831.06 ms - 831.06 us (831.13 ms / 1000 ), min: 779.00 us, max:  1.72 ms, nesting: 0 - 1000
TIMING SSEFill2       :  1.21 s  -  1.21 ms ( 1.21 s  / 1000 ), min:  1.00 ms, max:  2.19 ms, nesting: 0 - 1000
TIMING Fill           : 915.70 ms - 915.70 us (915.76 ms / 1000 ), min: 859.00 us, max:  3.49 ms, nesting: 0 - 1000

MSBT19x64:

TIMING MemSet         : 818.33 ms - 818.33 us (818.36 ms / 1000 ), min: 777.00 us, max:  1.71 ms, nesting: 0 - 1000
TIMING SSEFill2       : 899.74 ms - 899.74 us (899.77 ms / 1000 ), min: 854.00 us, max:  1.78 ms, nesting: 0 - 1000
TIMING Fill           :  2.29 s  -  2.29 ms ( 2.29 s  / 1000 ), min:  2.21 ms, max:  4.51 ms, nesting: 0 - 1000

CLANG:

TIMING MemSet         : 835.39 ms - 835.39 us (835.45 ms / 1000 ), min: 790.00 us, max:  1.51 ms, nesting: 0 - 1000
TIMING SSEFill2       : 918.63 ms - 918.63 us (918.68 ms / 1000 ), min: 872.00 us, max:  1.47 ms, nesting: 0 - 1000
TIMING Fill           :  2.36 s  -  2.36 ms ( 2.36 s  / 1000 ), min:  2.28 ms, max:  5.45 ms, nesting: 0 - 1000

CLANGx64:

TIMING MemSet         : 838.86 ms - 838.86 us (838.89 ms / 1000 ), min: 787.00 us, max:  1.70 ms, nesting: 0 - 1000
TIMING SSEFill2       : 921.49 ms - 921.49 us (921.51 ms / 1000 ), min: 870.00 us, max:  1.84 ms, nesting: 0 - 1000
TIMING Fill           :  2.10 s  -  2.10 ms ( 2.10 s  / 1000 ), min:  2.01 ms, max:  5.00 ms, nesting: 0 - 1000



I trust you can now make all the different fillers through U++ to use this new code... right?

Thanks and best regards,

Tom
Re: BufferPainter::Clear() optimization [message #53934 is a reply to message #53923] Fri, 15 May 2020 23:13 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi Mirek,

The game is not over yet, I'm afraid. I did some additional benchmarking with varying buffer lengths to set. It get's more complicated...

	RGBA c = Red();
	
	int bsize=8*1024*1024;
	Buffer<RGBA> b(bsize,(RGBA)Blue());

	String result="\"N\",\"Fill()\",\"memsetd()\",\"MemSet()\"\r\n";
	for(int len=1;len<=bsize;len*=2){
		int maximum=1000000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++) Fill(~b, c, len);
		int64 t1=usecs(t0);
		t0=usecs();
		for(int i = 0; i < maximum; i++) memsetd(~b, *(dword*)&(c), len);
		int64 t2=usecs(t0);
		t0=usecs();
		for(int i = 0; i < maximum; i++) MemSet(~b, c, len);
		int64 t3=usecs(t0);
		result.Cat(Format("%d,%f,%f,%f\r\n",len,1000.0*t1/maximum,1000.0*t2/maximum,1000.0*t3/maximum));
	}
	
	SaveFile(GetHomeDirFile("Desktop/memset.csv"),result);


Now, if you import the resulting memset.csv to your spreadsheet program and create a log-log plot, you will see that the different buffer lengths have a huge impact on the performance of each algorithm. As filling lengths can be quite diverse, I think we need to think about some combination of the different algorithms. Additionally, we need to look at the results on different CPUs. I will keep tinkering on this one for a while here.

(Now I'm running on Core i7 here at home, so this one I can test easily, and also the Core i9 at the office every now and then, as the situation is what it is...)

Best regards,

Tom
Re: BufferPainter::Clear() optimization [message #53935 is a reply to message #53923] Fri, 15 May 2020 23:45 Go to previous messageGo to previous message
Didier is currently offline  Didier
Messages: 680
Registered: November 2008
Location: France
Contributor
Here is what I get on my Linux and Ryzen 2700
Du to unstable results with 10 loops, I also placed results for 1000 loops

The new MemSet() is definitly really a good addition Smile

==== CLANG X64 ====
TIMING MemSet : 10.00 ms - 999.98 us (10.00 ms / 10 ), min: 1.00 ms, max: 1.00 ms, nesting: 0 - 10
TIMING SSEFill2 : 12.00 ms - 1.20 ms (12.00 ms / 10 ), min: 1.00 ms, max: 2.00 ms, nesting: 0 - 10
TIMING CoFill : 21.00 ms - 2.10 ms (21.00 ms / 10 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 3.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING memsetd : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 9.00 ms, nesting: 0 - 10

TIMING MemSet : 833.97 ms - 833.97 us (834.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING SSEFill2 : 870.97 ms - 870.97 us (871.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING CoFill : 1.88 s - 1.88 ms ( 1.88 s / 1000 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 1000
TIMING Fill : 2.90 s - 2.90 ms ( 2.90 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.51 s - 2.51 ms ( 2.51 s / 1000 ), min: 2.00 ms, max: 10.00 ms, nesting: 0 - 1000


==== GCC X64 ====
TIMING MemSet : 7.00 ms - 699.98 us ( 7.00 ms / 10 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 10
TIMING SSEFill2 : 9.00 ms - 899.98 us ( 9.00 ms / 10 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 10
TIMING CoFill : 23.00 ms - 2.30 ms (23.00 ms / 10 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 10
TIMING memsetd : 35.00 ms - 3.50 ms (35.00 ms / 10 ), min: 2.00 ms, max: 10.00 ms, nesting: 0 - 10

TIMING MemSet : 820.98 ms - 820.98 us (821.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING SSEFill2 : 877.98 ms - 877.98 us (878.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING CoFill : 1.85 s - 1.85 ms ( 1.85 s / 1000 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 1000
TIMING Fill : 2.97 s - 2.97 ms ( 2.97 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.52 s - 2.52 ms ( 2.52 s / 1000 ), min: 2.00 ms, max: 8.00 ms, nesting: 0 - 1000

Previous Topic: Should we still care about big-endian CPUs?
Next Topic: TheIDE crash after switching package
Goto Forum:
  


Current Time: Fri Apr 19 11:27:27 CEST 2024

Total time taken to generate the page: 0.02154 seconds