U++ forum: Welcome to the forum

Status & Roadmap

Authors & License

Funding Ultimate++

Search on this site

Search in forums

Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Switch to threaded view of this topic

Create a new topic

Submit Reply

SSE2 and SVO optimization (Painter, memcpy....) [message #53751]

Mon, 27 April 2020 19:19

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

Hi,

Here's an optimization for BufferPainter.

BufferPainter::Clear(RGBA) speed is improved by over 30 % with the following change in Painter/Render.cpp:

void BufferPainter::ClearOp(const RGBA& color)
{
//	UPP::Fill(~*ip, color, ip->GetLength());
	FillRGBA(~*ip, color, ip->GetLength());
	ip->SetKind(color.a == 255 ? IMAGE_OPAQUE : IMAGE_ALPHA);
}

And in Painter/Fillers.h:

namespace Upp {

// Add the following line:
#define FillRGBA(a,b,c) memsetd((a),*(dword*)&(b),(c)) 

struct SolidFiller : Rasterizer::Filler {

This may be significant in some usage scenarios as it can currently take e.g. 4.5 milliseconds to clear a 4K ImageBuffer before drawing to it. This can now be reduced to 2.8 milliseconds.

Best regards,

Tom

EDIT: Changed code to use the newly optimized FillRGBA() found in Fillers.h. This can be found at:
https://www.ultimatepp.org/forums/index.php?t=msg&th=110 11&goto=53752&#msg_53752

[Updated on: Sun, 24 May 2020 10:22] by Moderator

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53757 is a reply to message #53751]

Tue, 28 April 2020 10:12

mirek is currently offline

mirek
Messages: 14267
Registered: November 2005

Ultimate Member

Tom1 wrote on Mon, 27 April 2020 19:19

Hi,

Here's an optimization for BufferPainter.

BufferPainter::Clear(RGBA) speed is improved by over 30 % with the following change in Painter/Render.cpp:
void BufferPainter::ClearOp(const RGBA& color)
{
//	UPP::Fill(~*ip, color, ip->GetLength());
	FillRGBA(~*ip, color, ip->GetLength());
	ip->SetKind(color.a == 255 ? IMAGE_OPAQUE : IMAGE_ALPHA);
}
And in Painter/Fillers.h:
namespace Upp {

// Add the following line:
#define FillRGBA(a,b,c) memsetd((a),*(dword*)&(b),(c)) 

struct SolidFiller : Rasterizer::Filler {
This may be significant in some usage scenarios as it can currently take e.g. 4.5 milliseconds to clear a 4K ImageBuffer before drawing to it. This can now be reduced to 2.8 milliseconds.

Now this is really interesting. Fill for RGBA* is actually one that is optimized for filling huge blocks. I will need to do some benchmarks...

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53758 is a reply to message #53757]

Tue, 28 April 2020 10:20

mirek is currently offline

mirek
Messages: 14267
Registered: November 2005

Ultimate Member

Current Fill(RGBA * assembler code

4000EEE0  cmp r8d,byte +0x10 
4000EEE4  jl 0x14000ef13 
4000EEE6  movd xmm0,edx 
4000EEEA  pshufd xmm0,xmm0,0x0 
4000EEEF  nop 
4000EEF0  mov eax,r8d 
4000EEF3  movdqu [rcx],xmm0 
4000EEF7  movdqu [rcx+0x10],xmm0 
4000EEFC  movdqu [rcx+0x20],xmm0 
4000EF01  movdqu [rcx+0x30],xmm0 
4000EF06  add rcx,byte +0x40 
4000EF0A  lea r8d,[rax-0x10] 
4000EF0E  cmp eax,byte +0x1f 
4000EF11  jg 0x14000eef0 
4000EF13  add r8d,byte -0x1 
4000EF17  cmp r8d,byte +0xe 
4000EF1B  ja 0x14000ef59 
4000EF1D  lea r9,[rel 0x4000ef5c] 
4000EF24  movsxd rax,dword [r9+r8*4] 
4000EF28  add rax,r9 
4000EF2B  jmp rax 
4000EF2D  mov [rcx+0x38],edx 
4000EF30  mov [rcx+0x34],edx 
4000EF33  mov [rcx+0x30],edx 
4000EF36  mov [rcx+0x2c],edx 
4000EF39  mov [rcx+0x28],edx 
4000EF3C  mov [rcx+0x24],edx 
4000EF3F  mov [rcx+0x20],edx 
4000EF42  mov [rcx+0x1c],edx 
4000EF45  mov [rcx+0x18],edx 
4000EF48  mov [rcx+0x14],edx 
4000EF4B  mov [rcx+0x10],edx 
4000EF4E  mov [rcx+0xc],edx 
4000EF51  mov [rcx+0x8],edx 
4000EF54  mov [rcx+0x4],edx 
4000EF57  mov [rcx],edx 
4000EF59  ret

and the central snippet from the memsetd variant....

40001565  movaps xmm0,[rel 0x402c60a0] 
4000156C  nop dword [rax+0x0] 
40001570  movups [rsi+rdx*4],xmm0 
40001574  movups [rsi+rdx*4+0x10],xmm0 
40001579  movups [rsi+rdx*4+0x20],xmm0 
4000157E  movups [rsi+rdx*4+0x30],xmm0 
40001583  movups [rsi+rdx*4+0x40],xmm0 
40001588  movups [rsi+rdx*4+0x50],xmm0 
4000158D  movups [rsi+rdx*4+0x60],xmm0 
40001592  movups [rsi+rdx*4+0x70],xmm0 
40001597  movups [rsi+rdx*4+0x80],xmm0 
4000159F  movups [rsi+rdx*4+0x90],xmm0 
400015A7  movups [rsi+rdx*4+0xa0],xmm0 
400015AF  movups [rsi+rdx*4+0xb0],xmm0 
400015B7  movups [rsi+rdx*4+0xc0],xmm0 
400015BF  movups [rsi+rdx*4+0xd0],xmm0 
400015C7  movups [rsi+rdx*4+0xe0],xmm0 
400015CF  movups [rsi+rdx*4+0xf0],xmm0 
400015D7  add rdx,byte +0x40 
400015DB  add rdi,byte +0x8 
400015DF  jnz 0x140001570

Interesting...

Benchmarking code

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 1000; i++) {
		{
			RTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
		{
			RTIMING("Fill");
			Fill(b, c, len);
		}
	}
}

CLANGx64, 2700x

TIMING Fill : 2.73 s - 2.73 ms ( 2.73 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.78 s - 2.78 ms ( 2.78 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000

MSBT19x64

TIMING Fill : 2.89 s - 2.89 ms ( 2.89 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.90 s - 2.90 ms ( 2.90 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000

[Updated on: Tue, 28 April 2020 10:31]

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53760 is a reply to message #53757]

Tue, 28 April 2020 10:27

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

Hi,

Benchmarking and tuning is exactly what I did through yesterday (and beyond). I worked with both CLANGx64 and MSBT19x64. I worked out a bunch of optimized fillers until it turned out that memsetd() wins easily on large blocks and mostly on smaller blocks too. Especially on MSBT19x64 there does not seem to be a way to beat memsetd(). On CLANGx64 small transfer of one or two items was slightly faster, but on larger blocks memsetd() won again. Interestingly, CLANGx64 was a lot faster than MSBT19x64 for any of my own block transfer attempts, but still could not beat memsetd().

Best regards,

Tom

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53761 is a reply to message #53760]

Tue, 28 April 2020 10:33

mirek is currently offline

mirek
Messages: 14267
Registered: November 2005

Ultimate Member

I guess it might be CPU related... ?

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53763 is a reply to message #53761]

Tue, 28 April 2020 11:10

mirek is currently offline

mirek
Messages: 14267
Registered: November 2005

Ultimate Member

Hm, MacOS 2,3 GHz Intel Core i5

TIMING Fill : 1.52 s - 1.52 ms ( 1.52 s / 1000 ), min: 1.00 ms, max: 2.00 ms, nesting: 0 - 1000
TIMING memsetd : 1.53 s - 1.53 ms ( 1.53 s / 1000 ), min: 1.00 ms, max: 12.00 ms, nesting: 0 - 1000

That's quite weird...

Mirek

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53764 is a reply to message #53761]

Tue, 28 April 2020 11:17

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

Hi,

Yes, CPU is likely the major player here. I took the liberty to modify TimingInspector to get finer granularity for timing using usecs(). The modified testcase can be found below. I get the following results on my Core i7 + Windows 10 professional x64. Now we can focus on the best round 'min:' to better avoid other tasks' effect. As you can see memsetd on MSBT19x64 is quite amazing performer.

MSBT19x64, Intel Core i7:

TIMING memsetd : 1.45 s - 1.45 ms ( 1.45 s / 1000 ), min: 1.15 ms, max: 5.25 ms, nesting: 0 - 1000
TIMING Fill : 3.73 s - 3.73 ms ( 3.73 s / 1000 ), min: 3.25 ms, max: 9.92 ms, nesting: 0 - 1000

CLANGx64, Intel Core i7:

TIMING memsetd : 3.85 s - 3.85 ms ( 3.85 s / 1000 ), min: 3.35 ms, max: 10.36 ms, nesting: 0 - 1000
TIMING Fill : 3.87 s - 3.87 ms ( 3.87 s / 1000 ), min: 3.38 ms, max: 11.33 ms, nesting: 0 - 1000

I guess that in my larger program the optimizations did not work this well as the Fill would have performed at around 5 ms level for this size of a buffer.

Anyway here's the modified benchmark.

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

class UTimingInspector {
protected:
	static bool active;

	const char *name;
	int         call_count;
	int64       total_time;
	int64       min_time;
	int64       max_time;
	int         max_nesting;
	int         all_count;
	StaticMutex mutex;

public:
	UTimingInspector(const char *name = NULL); // Not String !!!
	~UTimingInspector();

	void   Add(dword time, int nesting);

	String Dump();

	class Routine {
	public:
		Routine(UTimingInspector& stat, int& nesting)
		: nesting(nesting), stat(stat) {
			start_time = usecs();
			nesting++;
		}

		~Routine() {
			nesting--;
			stat.Add(start_time, nesting);
		}

	protected:
		int64 start_time;
		int& nesting;
		UTimingInspector& stat;
	};

	static void Activate(bool b)                    { active = b; }
};

bool UTimingInspector::active = true;

static UTimingInspector s_zero; // time of Start / End without actual body to measure

UTimingInspector::UTimingInspector(const char *_name) {
	name = _name ? _name : "";
	all_count = call_count = max_nesting = min_time = max_time = total_time = 0;
	static bool init;
	if(!init) {
#if defined(PLATFORM_WIN32) && !defined(PLATFORM_WINCE)
		timeBeginPeriod(1);
#endif
		init = true;
	}
}

UTimingInspector::~UTimingInspector() {
	if(this == &s_zero) return;
	Mutex::Lock __(mutex);
	StdLog() << Dump() << "\r\n";
}

void UTimingInspector::Add(dword time, int nesting)
{
	time = usecs() - time;
	Mutex::Lock __(mutex);
	if(!active) return;
	all_count++;
	if(nesting > max_nesting)
		max_nesting = nesting;
	if(nesting == 0) {
		total_time += time;
		if(call_count++ == 0)
			min_time = max_time = time;
		else {
			if(time < min_time)
				min_time = time;
			if(time > max_time)
				max_time = time;
		}
	}
}

String UTimingInspector::Dump() {
	Mutex::Lock __(mutex);
	String s = Sprintf("TIMING %-15s: ", name);
	if(call_count == 0)
		return s + "No active hit";
	ONCELOCK {
		int w = GetTickCount();
		while(GetTickCount() - w < 200) { // measure profiling overhead
			thread_local int nesting = 0;
			UTimingInspector::Routine __(s_zero, nesting);
		}
	}
	double tm = max(0.0, double(total_time) / call_count / 1000000 -
			             double(s_zero.total_time) / s_zero.call_count / 1000000);
	return s
	       + timeFormat(tm * call_count)
	       + " - " + timeFormat(tm)
	       + " (" + timeFormat((double)total_time  / 1000000) + " / "
	       + Sprintf("%d )", call_count)
		   + ", min: " + timeFormat((double)min_time / 1000000)
		   + ", max: " + timeFormat((double)max_time / 1000000)
		   + Sprintf(", nesting: %d - %d", max_nesting, all_count);
}

#define RUTIMING(x) \
	static UTimingInspector COMBINE(sTmStat, __LINE__)(x); \
	static thread_local int COMBINE(sTmStatNesting, __LINE__); \
	UTimingInspector::Routine COMBINE(sTmStatR, __LINE__)(COMBINE(sTmStat, __LINE__), COMBINE(sTmStatNesting, __LINE__))

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 1000; i++) {
		{
			RUTIMING("Fill");
			Fill(b, c, len);
		}
		{
			RUTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
	}
}

Best regards,

Tom

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53765 is a reply to message #53763]

Tue, 28 April 2020 11:27

Oblivion is currently offline

Oblivion
Messages: 1225
Registered: August 2007

Senior Contributor

Hello,

A quick test on an older AMD FX 6100, six core processor. 3.2 GHZ (naturally, it is slower):

// GCC (x64, latest ver.)
TIMING Fill           :  7,53 s  -  7,53 ms ( 7,53 s  / 1000 ), min:  7,00 ms, max:  9,00 ms, nesting: 0 - 1000
TIMING memsetd        :  6,31 s  -  6,31 ms ( 6,31 s  / 1000 ), min:  6,00 ms, max: 18,00 ms, nesting: 0 - 1000
                                                                                     ----
// CLANG(x64, latest ver.)
 TIMING Fill           :  7,07 s  -  7,07 ms ( 7,07 s  / 1000 ), min:  6,00 ms, max: 10,00 ms, nesting: 0 - 1000
 TIMING memsetd        :  7,07 s  -  7,07 ms ( 7,08 s  / 1000 ), min:  6,00 ms, max: 17,00 ms, nesting: 0 - 1000
                                                                                     -----

Best regards,
Oblivion

Github page: https://github.com/ismail-yilmaz
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53913 is a reply to message #53751]

Fri, 15 May 2020 09:04

mirek is currently offline

mirek
Messages: 14267
Registered: November 2005

Ultimate Member

Experimenting with parallel:

#include <CtrlLib/CtrlLib.h>

using namespace Upp;

void CoFill(RGBA *t, RGBA c, int len)
{
	const int CHUNK = 1024;
	std::atomic<int> ii(0);
	CoDo([&] {
		for(;;) {
			int pos = CHUNK * ii++;
			if(pos >= len)
				break;
			Fill(t + pos, c, min(CHUNK, len - pos));
		}
	});
}

GUI_APP_MAIN
{
	Color c = Red();
	
	int len = 4000 * 2000;
	
	Buffer<RGBA> b(len);

	for(int i = 0; i < 10; i++) {
		{
			RTIMING("memsetd");
			memsetd(b, *(dword*)&(c), len);
		}
		{
			RTIMING("Fill");
			Fill(b, c, len);
		}
		{
			RTIMING("CoFill");
			CoFill(b, c, len);
		}
	}
}

TIMING CoFill         : 19.00 ms -  1.90 ms (19.00 ms / 10 ), min:  1.00 ms, max:  3.00 ms, nesting: 0 - 10
TIMING Fill           : 31.00 ms -  3.10 ms (31.00 ms / 10 ), min:  3.00 ms, max:  4.00 ms, nesting: 0 - 10
TIMING memsetd        : 30.00 ms -  3.00 ms (30.00 ms / 10 ), min:  2.00 ms, max:  5.00 ms, nesting: 0 - 10

To try that on different CPU, Rapsberry PI 4 numbers:

TIMING CoFill         : 145.00 ms - 14.50 ms (145.00 ms / 10 ), min: 14.00 ms, max: 15.00 ms, nesting: 0 - 10
TIMING Fill           : 225.00 ms - 22.50 ms (225.00 ms / 10 ), min: 22.00 ms, max: 24.00 ms, nesting: 0 - 10
TIMING memsetd        : 184.00 ms - 18.40 ms (184.00 ms / 10 ), min: 11.00 ms, max: 77.00 ms, nesting: 0 - 10

[Updated on: Fri, 15 May 2020 10:18]

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53914 is a reply to message #53913]

Fri, 15 May 2020 10:18

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

Hi Mirek,

While interesting, I found that a plain memset() is way faster than memsetd() or Fill(). Just filling with 0xff (as the RGBA is for white) you will get a superior speed. I currently use memset() for a clear white on a ImageBuffer before giving it to BufferPainter. For more complex fill colors, I guess, the apex_memmove / memcpy code could be investigated for a more optimal result. (I posted a link to the apex code here on the forum briefly before release of 2020.1 Smile

Smile

Best regards,

Tom

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53915 is a reply to message #53914]

Fri, 15 May 2020 11:33

mirek is currently offline

mirek
Messages: 14267
Registered: November 2005

Ultimate Member

Tom1 wrote on Fri, 15 May 2020 10:18

While interesting, I found that a plain memset() is way faster than memsetd() or Fill(). Just filling with 0xff (as the RGBA is for white) you will get a superior speed. I currently use memset() for a clear white on a ImageBuffer before giving it to BufferPainter. For more complex fill colors, I guess, the apex_memmove / memcpy code could be investigated for a more optimal result. (I posted a link to the apex code here on the forum briefly before release of 2020.1

Best regards,

Tom

With CLANG, memset performance is about the same. However, with MSVC, it really is pretty damn fast.

I have digged into the code and the key ingredient seems to be MOVNTPS instruction, which means the code could be easily adapted to setting dwords. I just need to understand MT implications mentioned here:

https://www.felixcloutier.com/x86/movntps

It also might be questionable how this will affect the performance down the road (data not being in cache and everything...)

Mirek

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53916 is a reply to message #53915]

Fri, 15 May 2020 11:41

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

At the time I was testing with the memset -- if I remember correctly -- on Windows + CLANG the memset with zero value was very efficient too, but the rest of the set values were slower. So, there must be some special optimized implementation for zeroing memory on CLANG too.

BR, Tom

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53917 is a reply to message #53915]

Fri, 15 May 2020 11:47

mirek is currently offline

mirek
Messages: 14267
Registered: November 2005

Ultimate Member

Here we go:

void SSEFill2(RGBA *t, RGBA c, int len)
{
	if(len >= 512) {
		while((uintptr_t)t & 63) { // align to cache line
			*t++ = c;
			len--;
		}
		dword m[4];
		m[0] = m[1] = m[2] = m[3] = *(dword*)&(c);
		__m128d val = _mm_loadu_pd((double *)m);
		while(len >= 16) {
			_mm_stream_pd((double *)t, val);
			_mm_stream_pd((double *)(t + 4), val);
			_mm_stream_pd((double *)(t + 8), val);
			_mm_stream_pd((double *)(t + 12), val);
			t += 16;
			len -= 16;
		}
		_mm_sfence();
	}

	Fill(t, c, len);
}

TIMING CoFill         : 42.00 ms -  2.10 ms (42.00 ms / 20 ), min:  1.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING SSEFill2       : 16.00 ms - 799.98 us (16.00 ms / 20 ), min:  0.00 ns, max:  1.00 ms, nesting: 0 - 20
TIMING SSEFill        : 55.00 ms -  2.75 ms (55.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING Fill           : 56.00 ms -  2.80 ms (56.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20
TIMING memsetd        : 52.00 ms -  2.60 ms (52.00 ms / 20 ), min:  2.00 ms, max:  3.00 ms, nesting: 0 - 20

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53918 is a reply to message #53917]

Fri, 15 May 2020 12:08

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

Laughing

And we have a winner!!

Also, please take a look at MSBT19 and MSBT19x64 for this too. It looks like this code only works with CLANG and CLANGx64 on Windows. (Have not checked on Linux yet.)
Additionally, plain memset, memsets and memsetd -variants would be useful for various tasks, as their efficiency varies depending on the compiler.

Thanks and best regards,

Tom

EDIT: I mean it does not compile on MSBT...

[Updated on: Fri, 15 May 2020 12:09]

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53919 is a reply to message #53751]

Fri, 15 May 2020 12:16

Oblivion is currently offline

Oblivion
Messages: 1225
Registered: August 2007

Senior Contributor

On linux with the relatively old AMD Athlon FX 6100.

Works with both GCC (9.3) and CLANG (10.0). Requires #include <smmintrin.h>:

GCC:

TIMING SSEFill2       : 43,99 ms -  4,40 ms (44,00 ms / 10 ), min:  4,00 ms, max:  5,00 ms, nesting: 0 - 10
TIMING CoFill         : 55,99 ms -  5,60 ms (56,00 ms / 10 ), min:  5,00 ms, max:  6,00 ms, nesting: 0 - 10
TIMING Fill           : 75,99 ms -  7,60 ms (76,00 ms / 10 ), min:  7,00 ms, max:  8,00 ms, nesting: 0 - 10
TIMING memsetd        : 66,99 ms -  6,70 ms (67,00 ms / 10 ), min:  5,00 ms, max: 17,00 ms, nesting: 0 - 10

CLANG:

TIMING SSEFill2       : 45,99 ms -  4,60 ms (46,00 ms / 10 ), min:  4,00 ms, max:  7,00 ms, nesting: 0 - 10
TIMING CoFill         : 55,99 ms -  5,60 ms (56,00 ms / 10 ), min:  5,00 ms, max:  6,00 ms, nesting: 0 - 10
TIMING Fill           : 65,99 ms -  6,60 ms (66,00 ms / 10 ), min:  6,00 ms, max: 10,00 ms, nesting: 0 - 10
TIMING memsetd        : 78,99 ms -  7,90 ms (79,00 ms / 10 ), min:  5,00 ms, max: 23,00 ms, nesting: 0 - 10

Github page: https://github.com/ismail-yilmaz
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat

[Updated on: Fri, 15 May 2020 12:27]

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53920 is a reply to message #53919]

Fri, 15 May 2020 12:28

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

Hi,

Thanks Oblivion; the #include <smmintrin.h> was exactly what was needed on Windows + CLANG too...

Here are the results for the 4k RGBA fill on Windows 10 x64 on Core i9:

MSBT19:
TIMING SSEFill2       :  1.30 s  -  1.30 ms ( 1.30 s  / 1000 ), min:  1.03 ms, max:  1.99 ms, nesting: 0 - 1000
TIMING Fill           :  1.13 s  -  1.13 ms ( 1.13 s  / 1000 ), min: 841.00 us, max:  3.04 ms, nesting: 0 - 1000

MSBT19x64:
TIMING SSEFill2       : 906.90 ms - 906.90 us (906.93 ms / 1000 ), min: 846.00 us, max:  1.67 ms, nesting: 0 - 1000
TIMING Fill           :  2.34 s  -  2.34 ms ( 2.34 s  / 1000 ), min:  2.21 ms, max:  4.69 ms, nesting: 0 - 1000

CLANG:
TIMING SSEFill2       : 935.97 ms - 935.97 us (936.02 ms / 1000 ), min: 854.00 us, max:  1.67 ms, nesting: 0 - 1000
TIMING Fill           :  2.44 s  -  2.44 ms ( 2.44 s  / 1000 ), min:  2.25 ms, max:  4.74 ms, nesting: 0 - 1000

CLANGx64:
TIMING SSEFill2       : 934.45 ms - 934.45 us (934.47 ms / 1000 ), min: 854.00 us, max:  1.77 ms, nesting: 0 - 1000
TIMING Fill           :  2.20 s  -  2.20 ms ( 2.20 s  / 1000 ), min:  1.98 ms, max:  5.97 ms, nesting: 0 - 1000

Looks very good indeed! MSBT19 on the other hand looks surprising...

Best regards,

Tom

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53922 is a reply to message #53918]

Fri, 15 May 2020 13:15

mirek is currently offline

mirek
Messages: 14267
Registered: November 2005

Ultimate Member

Tom1 wrote on Fri, 15 May 2020 12:08

Additionally, plain memset, memsets and memsetd -variants would be useful for various tasks, as their efficiency varies depending on the compiler.

What about this:

void FillCacheLines(void *cache_aligned_ptr, void *data16, int count)
{
	dword *t = (dword *)cache_aligned_ptr;
	__m128d val = _mm_loadu_pd((double *)data16);
	dword *e = t + 16 * count;
	while(t < e) {
		_mm_stream_pd((double *)t, val);
		_mm_stream_pd((double *)(t + 4), val);
		_mm_stream_pd((double *)(t + 8), val);
		_mm_stream_pd((double *)(t + 12), val);
		t += 16;
	}
	_mm_sfence();
}

template <class T>
void MemSet(void *dest, T data, int len)
{
	static_assert(sizeof(T) == 1 || sizeof(T) == 2 || sizeof(T) == 4 || sizeof(T) == 8 || sizeof(T) == 16, "invalid sizeof");
	T *t = (T *)dest;
	if(len * sizeof(T) > 550) {
		while((uintptr_t)t & 63) { // align to cache line
			*t++ = data;
			len--;
		}
		const int itemn = 16 / sizeof(T);
		const int per_cache_line = 4 * itemn;
		T m[itemn];
		for(int i = 0; i < itemn; i++)
			m[i] = data;
		int count = len / per_cache_line;
		FillCacheLines(t, m, count);
		len -= per_cache_line * count;
	}
	
	while(len >= 16) {
		t[0] = data; t[1] = data; t[2] = data; t[3] = data;
		t[4] = data; t[5] = data; t[6] = data; t[7] = data;
		t[8] = data; t[9] = data; t[10] = data; t[11] = data;
		t[12] = data; t[13] = data; t[14] = data; t[15] = data;
		t += 16;
		len -= 16;
	}
	switch(len) {
	case 15: t[14] = data;
	case 14: t[13] = data;
	case 13: t[12] = data;
	case 12: t[11] = data;
	case 11: t[10] = data;
	case 10: t[9] = data;
	case 9: t[8] = data;
	case 8: t[7] = data;
	case 7: t[6] = data;
	case 6: t[5] = data;
	case 5: t[4] = data;
	case 4: t[3] = data;
	case 3: t[2] = data;
	case 2: t[1] = data;
	case 1: t[0] = data;
	}
}

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53923 is a reply to message #53922]

Fri, 15 May 2020 13:36

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

Mirek,

Yes, absolutely beautiful!

The results for the set including the new MemSet() on Win10x64 on Core i9 are:

MSBT19:

TIMING MemSet         : 831.06 ms - 831.06 us (831.13 ms / 1000 ), min: 779.00 us, max:  1.72 ms, nesting: 0 - 1000
TIMING SSEFill2       :  1.21 s  -  1.21 ms ( 1.21 s  / 1000 ), min:  1.00 ms, max:  2.19 ms, nesting: 0 - 1000
TIMING Fill           : 915.70 ms - 915.70 us (915.76 ms / 1000 ), min: 859.00 us, max:  3.49 ms, nesting: 0 - 1000

MSBT19x64:

TIMING MemSet         : 818.33 ms - 818.33 us (818.36 ms / 1000 ), min: 777.00 us, max:  1.71 ms, nesting: 0 - 1000
TIMING SSEFill2       : 899.74 ms - 899.74 us (899.77 ms / 1000 ), min: 854.00 us, max:  1.78 ms, nesting: 0 - 1000
TIMING Fill           :  2.29 s  -  2.29 ms ( 2.29 s  / 1000 ), min:  2.21 ms, max:  4.51 ms, nesting: 0 - 1000

CLANG:

TIMING MemSet         : 835.39 ms - 835.39 us (835.45 ms / 1000 ), min: 790.00 us, max:  1.51 ms, nesting: 0 - 1000
TIMING SSEFill2       : 918.63 ms - 918.63 us (918.68 ms / 1000 ), min: 872.00 us, max:  1.47 ms, nesting: 0 - 1000
TIMING Fill           :  2.36 s  -  2.36 ms ( 2.36 s  / 1000 ), min:  2.28 ms, max:  5.45 ms, nesting: 0 - 1000

CLANGx64:

TIMING MemSet         : 838.86 ms - 838.86 us (838.89 ms / 1000 ), min: 787.00 us, max:  1.70 ms, nesting: 0 - 1000
TIMING SSEFill2       : 921.49 ms - 921.49 us (921.51 ms / 1000 ), min: 870.00 us, max:  1.84 ms, nesting: 0 - 1000
TIMING Fill           :  2.10 s  -  2.10 ms ( 2.10 s  / 1000 ), min:  2.01 ms, max:  5.00 ms, nesting: 0 - 1000

I trust you can now make all the different fillers through U++ to use this new code... right?

Thanks and best regards,

Tom

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53934 is a reply to message #53923]

Fri, 15 May 2020 23:13

Tom1
Messages: 1305
Registered: March 2007

Ultimate Contributor

Hi Mirek,

The game is not over yet, I'm afraid. I did some additional benchmarking with varying buffer lengths to set. It get's more complicated...

	RGBA c = Red();
	
	int bsize=8*1024*1024;
	Buffer<RGBA> b(bsize,(RGBA)Blue());

	String result="\"N\",\"Fill()\",\"memsetd()\",\"MemSet()\"\r\n";
	for(int len=1;len<=bsize;len*=2){
		int maximum=1000000000/len;
		int64 t0=usecs();
		for(int i = 0; i < maximum; i++) Fill(~b, c, len);
		int64 t1=usecs(t0);
		t0=usecs();
		for(int i = 0; i < maximum; i++) memsetd(~b, *(dword*)&(c), len);
		int64 t2=usecs(t0);
		t0=usecs();
		for(int i = 0; i < maximum; i++) MemSet(~b, c, len);
		int64 t3=usecs(t0);
		result.Cat(Format("%d,%f,%f,%f\r\n",len,1000.0*t1/maximum,1000.0*t2/maximum,1000.0*t3/maximum));
	}
	
	SaveFile(GetHomeDirFile("Desktop/memset.csv"),result);

Now, if you import the resulting memset.csv to your spreadsheet program and create a log-log plot, you will see that the different buffer lengths have a huge impact on the performance of each algorithm. As filling lengths can be quite diverse, I think we need to think about some combination of the different algorithms. Additionally, we need to look at the results on different CPUs. I will keep tinkering on this one for a while here.

(Now I'm running on Core i7 here at home, so this one I can test easily, and also the Core i9 at the office every now and then, as the situation is what it is...)

Best regards,

Tom

Report message to a moderator

Send a private message to this user

Re: BufferPainter::Clear() optimization [message #53935 is a reply to message #53923]

Fri, 15 May 2020 23:45

Didier is currently offline

Didier
Messages: 736
Registered: November 2008
Location: France

Contributor

Here is what I get on my Linux and Ryzen 2700
Du to unstable results with 10 loops, I also placed results for 1000 loops

The new MemSet() is definitly really a good addition Smile

Smile

==== CLANG X64 ====
TIMING MemSet : 10.00 ms - 999.98 us (10.00 ms / 10 ), min: 1.00 ms, max: 1.00 ms, nesting: 0 - 10
TIMING SSEFill2 : 12.00 ms - 1.20 ms (12.00 ms / 10 ), min: 1.00 ms, max: 2.00 ms, nesting: 0 - 10
TIMING CoFill : 21.00 ms - 2.10 ms (21.00 ms / 10 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 3.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING memsetd : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 9.00 ms, nesting: 0 - 10

TIMING MemSet : 833.97 ms - 833.97 us (834.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING SSEFill2 : 870.97 ms - 870.97 us (871.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING CoFill : 1.88 s - 1.88 ms ( 1.88 s / 1000 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 1000
TIMING Fill : 2.90 s - 2.90 ms ( 2.90 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.51 s - 2.51 ms ( 2.51 s / 1000 ), min: 2.00 ms, max: 10.00 ms, nesting: 0 - 1000

==== GCC X64 ====
TIMING MemSet : 7.00 ms - 699.98 us ( 7.00 ms / 10 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 10
TIMING SSEFill2 : 9.00 ms - 899.98 us ( 9.00 ms / 10 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 10
TIMING CoFill : 23.00 ms - 2.30 ms (23.00 ms / 10 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 10
TIMING memsetd : 35.00 ms - 3.50 ms (35.00 ms / 10 ), min: 2.00 ms, max: 10.00 ms, nesting: 0 - 10

TIMING MemSet : 820.98 ms - 820.98 us (821.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING SSEFill2 : 877.98 ms - 877.98 us (878.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING CoFill : 1.85 s - 1.85 ms ( 1.85 s / 1000 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 1000
TIMING Fill : 2.97 s - 2.97 ms ( 2.97 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.52 s - 2.52 ms ( 2.52 s / 1000 ), min: 2.00 ms, max: 8.00 ms, nesting: 0 - 1000

Report message to a moderator

Send a private message to this user

Pages (6): [1 2 3 4 5 6 › »]

Switch to threaded view of this topic

Create a new topic

Submit Reply

Previous Topic:	Should we still care about big-endian CPUs?
Next Topic:	TheIDE crash after switching package

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

PDF

]

Current Time: Thu Aug 21 14:42:25 CEST 2025

Total time taken to generate the page: 0.06893 seconds