Home » Developing U++ » U++ Developers corner » SSE2 and SVO optimization (Painter, memcpy....)
SSE2 and SVO optimization (Painter, memcpy....) [message #53751] |
Mon, 27 April 2020 19:19  |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi,
Here's an optimization for BufferPainter.
BufferPainter::Clear(RGBA) speed is improved by over 30 % with the following change in Painter/Render.cpp:
void BufferPainter::ClearOp(const RGBA& color)
{
// UPP::Fill(~*ip, color, ip->GetLength());
FillRGBA(~*ip, color, ip->GetLength());
ip->SetKind(color.a == 255 ? IMAGE_OPAQUE : IMAGE_ALPHA);
}
And in Painter/Fillers.h:
namespace Upp {
// Add the following line:
#define FillRGBA(a,b,c) memsetd((a),*(dword*)&(b),(c))
struct SolidFiller : Rasterizer::Filler {
This may be significant in some usage scenarios as it can currently take e.g. 4.5 milliseconds to clear a 4K ImageBuffer before drawing to it. This can now be reduced to 2.8 milliseconds.
Best regards,
Tom
EDIT: Changed code to use the newly optimized FillRGBA() found in Fillers.h. This can be found at:
https://www.ultimatepp.org/forums/index.php?t=msg&th=110 11&goto=53752&#msg_53752
[Updated on: Sun, 24 May 2020 10:22] by Moderator Report message to a moderator
|
|
|
Re: BufferPainter::Clear() optimization [message #53757 is a reply to message #53751] |
Tue, 28 April 2020 10:12   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Mon, 27 April 2020 19:19Hi,
Here's an optimization for BufferPainter.
BufferPainter::Clear(RGBA) speed is improved by over 30 % with the following change in Painter/Render.cpp:
void BufferPainter::ClearOp(const RGBA& color)
{
// UPP::Fill(~*ip, color, ip->GetLength());
FillRGBA(~*ip, color, ip->GetLength());
ip->SetKind(color.a == 255 ? IMAGE_OPAQUE : IMAGE_ALPHA);
}
And in Painter/Fillers.h:
namespace Upp {
// Add the following line:
#define FillRGBA(a,b,c) memsetd((a),*(dword*)&(b),(c))
struct SolidFiller : Rasterizer::Filler {
This may be significant in some usage scenarios as it can currently take e.g. 4.5 milliseconds to clear a 4K ImageBuffer before drawing to it. This can now be reduced to 2.8 milliseconds.
Now this is really interesting. Fill for RGBA* is actually one that is optimized for filling huge blocks. I will need to do some benchmarks...
|
|
|
Re: BufferPainter::Clear() optimization [message #53758 is a reply to message #53757] |
Tue, 28 April 2020 10:20   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Current Fill(RGBA * assembler code
4000EEE0 cmp r8d,byte +0x10
4000EEE4 jl 0x14000ef13
4000EEE6 movd xmm0,edx
4000EEEA pshufd xmm0,xmm0,0x0
4000EEEF nop
4000EEF0 mov eax,r8d
4000EEF3 movdqu [rcx],xmm0
4000EEF7 movdqu [rcx+0x10],xmm0
4000EEFC movdqu [rcx+0x20],xmm0
4000EF01 movdqu [rcx+0x30],xmm0
4000EF06 add rcx,byte +0x40
4000EF0A lea r8d,[rax-0x10]
4000EF0E cmp eax,byte +0x1f
4000EF11 jg 0x14000eef0
4000EF13 add r8d,byte -0x1
4000EF17 cmp r8d,byte +0xe
4000EF1B ja 0x14000ef59
4000EF1D lea r9,[rel 0x4000ef5c]
4000EF24 movsxd rax,dword [r9+r8*4]
4000EF28 add rax,r9
4000EF2B jmp rax
4000EF2D mov [rcx+0x38],edx
4000EF30 mov [rcx+0x34],edx
4000EF33 mov [rcx+0x30],edx
4000EF36 mov [rcx+0x2c],edx
4000EF39 mov [rcx+0x28],edx
4000EF3C mov [rcx+0x24],edx
4000EF3F mov [rcx+0x20],edx
4000EF42 mov [rcx+0x1c],edx
4000EF45 mov [rcx+0x18],edx
4000EF48 mov [rcx+0x14],edx
4000EF4B mov [rcx+0x10],edx
4000EF4E mov [rcx+0xc],edx
4000EF51 mov [rcx+0x8],edx
4000EF54 mov [rcx+0x4],edx
4000EF57 mov [rcx],edx
4000EF59 ret
and the central snippet from the memsetd variant....
40001565 movaps xmm0,[rel 0x402c60a0]
4000156C nop dword [rax+0x0]
40001570 movups [rsi+rdx*4],xmm0
40001574 movups [rsi+rdx*4+0x10],xmm0
40001579 movups [rsi+rdx*4+0x20],xmm0
4000157E movups [rsi+rdx*4+0x30],xmm0
40001583 movups [rsi+rdx*4+0x40],xmm0
40001588 movups [rsi+rdx*4+0x50],xmm0
4000158D movups [rsi+rdx*4+0x60],xmm0
40001592 movups [rsi+rdx*4+0x70],xmm0
40001597 movups [rsi+rdx*4+0x80],xmm0
4000159F movups [rsi+rdx*4+0x90],xmm0
400015A7 movups [rsi+rdx*4+0xa0],xmm0
400015AF movups [rsi+rdx*4+0xb0],xmm0
400015B7 movups [rsi+rdx*4+0xc0],xmm0
400015BF movups [rsi+rdx*4+0xd0],xmm0
400015C7 movups [rsi+rdx*4+0xe0],xmm0
400015CF movups [rsi+rdx*4+0xf0],xmm0
400015D7 add rdx,byte +0x40
400015DB add rdi,byte +0x8
400015DF jnz 0x140001570
Interesting...
Benchmarking code
#include <CtrlLib/CtrlLib.h>
using namespace Upp;
GUI_APP_MAIN
{
Color c = Red();
int len = 4000 * 2000;
Buffer<RGBA> b(len);
for(int i = 0; i < 1000; i++) {
{
RTIMING("memsetd");
memsetd(b, *(dword*)&(c), len);
}
{
RTIMING("Fill");
Fill(b, c, len);
}
}
}
CLANGx64, 2700x
TIMING Fill : 2.73 s - 2.73 ms ( 2.73 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.78 s - 2.78 ms ( 2.78 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000
MSBT19x64
TIMING Fill : 2.89 s - 2.89 ms ( 2.89 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.90 s - 2.90 ms ( 2.90 s / 1000 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 1000
[Updated on: Tue, 28 April 2020 10:31] Report message to a moderator
|
|
|
Re: BufferPainter::Clear() optimization [message #53760 is a reply to message #53757] |
Tue, 28 April 2020 10:27   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi,
Benchmarking and tuning is exactly what I did through yesterday (and beyond). I worked with both CLANGx64 and MSBT19x64. I worked out a bunch of optimized fillers until it turned out that memsetd() wins easily on large blocks and mostly on smaller blocks too. Especially on MSBT19x64 there does not seem to be a way to beat memsetd(). On CLANGx64 small transfer of one or two items was slightly faster, but on larger blocks memsetd() won again. Interestingly, CLANGx64 was a lot faster than MSBT19x64 for any of my own block transfer attempts, but still could not beat memsetd().
Best regards,
Tom
|
|
|
|
|
Re: BufferPainter::Clear() optimization [message #53764 is a reply to message #53761] |
Tue, 28 April 2020 11:17   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi,
Yes, CPU is likely the major player here. I took the liberty to modify TimingInspector to get finer granularity for timing using usecs(). The modified testcase can be found below. I get the following results on my Core i7 + Windows 10 professional x64. Now we can focus on the best round 'min:' to better avoid other tasks' effect. As you can see memsetd on MSBT19x64 is quite amazing performer.
MSBT19x64, Intel Core i7:
TIMING memsetd : 1.45 s - 1.45 ms ( 1.45 s / 1000 ), min: 1.15 ms, max: 5.25 ms, nesting: 0 - 1000
TIMING Fill : 3.73 s - 3.73 ms ( 3.73 s / 1000 ), min: 3.25 ms, max: 9.92 ms, nesting: 0 - 1000
CLANGx64, Intel Core i7:
TIMING memsetd : 3.85 s - 3.85 ms ( 3.85 s / 1000 ), min: 3.35 ms, max: 10.36 ms, nesting: 0 - 1000
TIMING Fill : 3.87 s - 3.87 ms ( 3.87 s / 1000 ), min: 3.38 ms, max: 11.33 ms, nesting: 0 - 1000
I guess that in my larger program the optimizations did not work this well as the Fill would have performed at around 5 ms level for this size of a buffer.
Anyway here's the modified benchmark.
#include <CtrlLib/CtrlLib.h>
using namespace Upp;
class UTimingInspector {
protected:
static bool active;
const char *name;
int call_count;
int64 total_time;
int64 min_time;
int64 max_time;
int max_nesting;
int all_count;
StaticMutex mutex;
public:
UTimingInspector(const char *name = NULL); // Not String !!!
~UTimingInspector();
void Add(dword time, int nesting);
String Dump();
class Routine {
public:
Routine(UTimingInspector& stat, int& nesting)
: nesting(nesting), stat(stat) {
start_time = usecs();
nesting++;
}
~Routine() {
nesting--;
stat.Add(start_time, nesting);
}
protected:
int64 start_time;
int& nesting;
UTimingInspector& stat;
};
static void Activate(bool b) { active = b; }
};
bool UTimingInspector::active = true;
static UTimingInspector s_zero; // time of Start / End without actual body to measure
UTimingInspector::UTimingInspector(const char *_name) {
name = _name ? _name : "";
all_count = call_count = max_nesting = min_time = max_time = total_time = 0;
static bool init;
if(!init) {
#if defined(PLATFORM_WIN32) && !defined(PLATFORM_WINCE)
timeBeginPeriod(1);
#endif
init = true;
}
}
UTimingInspector::~UTimingInspector() {
if(this == &s_zero) return;
Mutex::Lock __(mutex);
StdLog() << Dump() << "\r\n";
}
void UTimingInspector::Add(dword time, int nesting)
{
time = usecs() - time;
Mutex::Lock __(mutex);
if(!active) return;
all_count++;
if(nesting > max_nesting)
max_nesting = nesting;
if(nesting == 0) {
total_time += time;
if(call_count++ == 0)
min_time = max_time = time;
else {
if(time < min_time)
min_time = time;
if(time > max_time)
max_time = time;
}
}
}
String UTimingInspector::Dump() {
Mutex::Lock __(mutex);
String s = Sprintf("TIMING %-15s: ", name);
if(call_count == 0)
return s + "No active hit";
ONCELOCK {
int w = GetTickCount();
while(GetTickCount() - w < 200) { // measure profiling overhead
thread_local int nesting = 0;
UTimingInspector::Routine __(s_zero, nesting);
}
}
double tm = max(0.0, double(total_time) / call_count / 1000000 -
double(s_zero.total_time) / s_zero.call_count / 1000000);
return s
+ timeFormat(tm * call_count)
+ " - " + timeFormat(tm)
+ " (" + timeFormat((double)total_time / 1000000) + " / "
+ Sprintf("%d )", call_count)
+ ", min: " + timeFormat((double)min_time / 1000000)
+ ", max: " + timeFormat((double)max_time / 1000000)
+ Sprintf(", nesting: %d - %d", max_nesting, all_count);
}
#define RUTIMING(x) \
static UTimingInspector COMBINE(sTmStat, __LINE__)(x); \
static thread_local int COMBINE(sTmStatNesting, __LINE__); \
UTimingInspector::Routine COMBINE(sTmStatR, __LINE__)(COMBINE(sTmStat, __LINE__), COMBINE(sTmStatNesting, __LINE__))
GUI_APP_MAIN
{
Color c = Red();
int len = 4000 * 2000;
Buffer<RGBA> b(len);
for(int i = 0; i < 1000; i++) {
{
RUTIMING("Fill");
Fill(b, c, len);
}
{
RUTIMING("memsetd");
memsetd(b, *(dword*)&(c), len);
}
}
}
Best regards,
Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #53765 is a reply to message #53763] |
Tue, 28 April 2020 11:27   |
Oblivion
Messages: 1206 Registered: August 2007
|
Senior Contributor |
|
|
Hello,
A quick test on an older AMD FX 6100, six core processor. 3.2 GHZ (naturally, it is slower):
// GCC (x64, latest ver.)
TIMING Fill : 7,53 s - 7,53 ms ( 7,53 s / 1000 ), min: 7,00 ms, max: 9,00 ms, nesting: 0 - 1000
TIMING memsetd : 6,31 s - 6,31 ms ( 6,31 s / 1000 ), min: 6,00 ms, max: 18,00 ms, nesting: 0 - 1000
----
// CLANG(x64, latest ver.)
TIMING Fill : 7,07 s - 7,07 ms ( 7,07 s / 1000 ), min: 6,00 ms, max: 10,00 ms, nesting: 0 - 1000
TIMING memsetd : 7,07 s - 7,07 ms ( 7,08 s / 1000 ), min: 6,00 ms, max: 17,00 ms, nesting: 0 - 1000
-----
Best regards,
Oblivion
Github page: https://github.com/ismail-yilmaz
upp-components: https://github.com/ismail-yilmaz/upp-components
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat
|
|
|
Re: BufferPainter::Clear() optimization [message #53913 is a reply to message #53751] |
Fri, 15 May 2020 09:04   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Experimenting with parallel:
#include <CtrlLib/CtrlLib.h>
using namespace Upp;
void CoFill(RGBA *t, RGBA c, int len)
{
const int CHUNK = 1024;
std::atomic<int> ii(0);
CoDo([&] {
for(;;) {
int pos = CHUNK * ii++;
if(pos >= len)
break;
Fill(t + pos, c, min(CHUNK, len - pos));
}
});
}
GUI_APP_MAIN
{
Color c = Red();
int len = 4000 * 2000;
Buffer<RGBA> b(len);
for(int i = 0; i < 10; i++) {
{
RTIMING("memsetd");
memsetd(b, *(dword*)&(c), len);
}
{
RTIMING("Fill");
Fill(b, c, len);
}
{
RTIMING("CoFill");
CoFill(b, c, len);
}
}
}
TIMING CoFill : 19.00 ms - 1.90 ms (19.00 ms / 10 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 31.00 ms - 3.10 ms (31.00 ms / 10 ), min: 3.00 ms, max: 4.00 ms, nesting: 0 - 10
TIMING memsetd : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 5.00 ms, nesting: 0 - 10
To try that on different CPU, Rapsberry PI 4 numbers:
TIMING CoFill : 145.00 ms - 14.50 ms (145.00 ms / 10 ), min: 14.00 ms, max: 15.00 ms, nesting: 0 - 10
TIMING Fill : 225.00 ms - 22.50 ms (225.00 ms / 10 ), min: 22.00 ms, max: 24.00 ms, nesting: 0 - 10
TIMING memsetd : 184.00 ms - 18.40 ms (184.00 ms / 10 ), min: 11.00 ms, max: 77.00 ms, nesting: 0 - 10
[Updated on: Fri, 15 May 2020 10:18] Report message to a moderator
|
|
|
|
Re: BufferPainter::Clear() optimization [message #53915 is a reply to message #53914] |
Fri, 15 May 2020 11:33   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Fri, 15 May 2020 10:18
While interesting, I found that a plain memset() is way faster than memsetd() or Fill(). Just filling with 0xff (as the RGBA is for white) you will get a superior speed. I currently use memset() for a clear white on a ImageBuffer before giving it to BufferPainter. For more complex fill colors, I guess, the apex_memmove / memcpy code could be investigated for a more optimal result. (I posted a link to the apex code here on the forum briefly before release of 2020.1 
Best regards,
Tom
With CLANG, memset performance is about the same. However, with MSVC, it really is pretty damn fast.
I have digged into the code and the key ingredient seems to be MOVNTPS instruction, which means the code could be easily adapted to setting dwords. I just need to understand MT implications mentioned here:
https://www.felixcloutier.com/x86/movntps
It also might be questionable how this will affect the performance down the road (data not being in cache and everything...)
Mirek
|
|
|
Re: BufferPainter::Clear() optimization [message #53916 is a reply to message #53915] |
Fri, 15 May 2020 11:41   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
At the time I was testing with the memset -- if I remember correctly -- on Windows + CLANG the memset with zero value was very efficient too, but the rest of the set values were slower. So, there must be some special optimized implementation for zeroing memory on CLANG too.
BR, Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #53917 is a reply to message #53915] |
Fri, 15 May 2020 11:47   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Here we go:
void SSEFill2(RGBA *t, RGBA c, int len)
{
if(len >= 512) {
while((uintptr_t)t & 63) { // align to cache line
*t++ = c;
len--;
}
dword m[4];
m[0] = m[1] = m[2] = m[3] = *(dword*)&(c);
__m128d val = _mm_loadu_pd((double *)m);
while(len >= 16) {
_mm_stream_pd((double *)t, val);
_mm_stream_pd((double *)(t + 4), val);
_mm_stream_pd((double *)(t + 8), val);
_mm_stream_pd((double *)(t + 12), val);
t += 16;
len -= 16;
}
_mm_sfence();
}
Fill(t, c, len);
}
TIMING CoFill : 42.00 ms - 2.10 ms (42.00 ms / 20 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 20
TIMING SSEFill2 : 16.00 ms - 799.98 us (16.00 ms / 20 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 20
TIMING SSEFill : 55.00 ms - 2.75 ms (55.00 ms / 20 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 20
TIMING Fill : 56.00 ms - 2.80 ms (56.00 ms / 20 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 20
TIMING memsetd : 52.00 ms - 2.60 ms (52.00 ms / 20 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 20
|
|
|
Re: BufferPainter::Clear() optimization [message #53918 is a reply to message #53917] |
Fri, 15 May 2020 12:08   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
And we have a winner!!
Also, please take a look at MSBT19 and MSBT19x64 for this too. It looks like this code only works with CLANG and CLANGx64 on Windows. (Have not checked on Linux yet.)
Additionally, plain memset, memsets and memsetd -variants would be useful for various tasks, as their efficiency varies depending on the compiler.
Thanks and best regards,
Tom
EDIT: I mean it does not compile on MSBT...
[Updated on: Fri, 15 May 2020 12:09] Report message to a moderator
|
|
|
Re: BufferPainter::Clear() optimization [message #53919 is a reply to message #53751] |
Fri, 15 May 2020 12:16   |
Oblivion
Messages: 1206 Registered: August 2007
|
Senior Contributor |
|
|
On linux with the relatively old AMD Athlon FX 6100.
Works with both GCC (9.3) and CLANG (10.0). Requires #include <smmintrin.h>:
GCC:
TIMING SSEFill2 : 43,99 ms - 4,40 ms (44,00 ms / 10 ), min: 4,00 ms, max: 5,00 ms, nesting: 0 - 10
TIMING CoFill : 55,99 ms - 5,60 ms (56,00 ms / 10 ), min: 5,00 ms, max: 6,00 ms, nesting: 0 - 10
TIMING Fill : 75,99 ms - 7,60 ms (76,00 ms / 10 ), min: 7,00 ms, max: 8,00 ms, nesting: 0 - 10
TIMING memsetd : 66,99 ms - 6,70 ms (67,00 ms / 10 ), min: 5,00 ms, max: 17,00 ms, nesting: 0 - 10
CLANG:
TIMING SSEFill2 : 45,99 ms - 4,60 ms (46,00 ms / 10 ), min: 4,00 ms, max: 7,00 ms, nesting: 0 - 10
TIMING CoFill : 55,99 ms - 5,60 ms (56,00 ms / 10 ), min: 5,00 ms, max: 6,00 ms, nesting: 0 - 10
TIMING Fill : 65,99 ms - 6,60 ms (66,00 ms / 10 ), min: 6,00 ms, max: 10,00 ms, nesting: 0 - 10
TIMING memsetd : 78,99 ms - 7,90 ms (79,00 ms / 10 ), min: 5,00 ms, max: 23,00 ms, nesting: 0 - 10
Github page: https://github.com/ismail-yilmaz
upp-components: https://github.com/ismail-yilmaz/upp-components
Bobcat the terminal emulator: https://github.com/ismail-yilmaz/Bobcat
[Updated on: Fri, 15 May 2020 12:27] Report message to a moderator
|
|
|
Re: BufferPainter::Clear() optimization [message #53920 is a reply to message #53919] |
Fri, 15 May 2020 12:28   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi,
Thanks Oblivion; the #include <smmintrin.h> was exactly what was needed on Windows + CLANG too...
Here are the results for the 4k RGBA fill on Windows 10 x64 on Core i9:
MSBT19:
TIMING SSEFill2 : 1.30 s - 1.30 ms ( 1.30 s / 1000 ), min: 1.03 ms, max: 1.99 ms, nesting: 0 - 1000
TIMING Fill : 1.13 s - 1.13 ms ( 1.13 s / 1000 ), min: 841.00 us, max: 3.04 ms, nesting: 0 - 1000
MSBT19x64:
TIMING SSEFill2 : 906.90 ms - 906.90 us (906.93 ms / 1000 ), min: 846.00 us, max: 1.67 ms, nesting: 0 - 1000
TIMING Fill : 2.34 s - 2.34 ms ( 2.34 s / 1000 ), min: 2.21 ms, max: 4.69 ms, nesting: 0 - 1000
CLANG:
TIMING SSEFill2 : 935.97 ms - 935.97 us (936.02 ms / 1000 ), min: 854.00 us, max: 1.67 ms, nesting: 0 - 1000
TIMING Fill : 2.44 s - 2.44 ms ( 2.44 s / 1000 ), min: 2.25 ms, max: 4.74 ms, nesting: 0 - 1000
CLANGx64:
TIMING SSEFill2 : 934.45 ms - 934.45 us (934.47 ms / 1000 ), min: 854.00 us, max: 1.77 ms, nesting: 0 - 1000
TIMING Fill : 2.20 s - 2.20 ms ( 2.20 s / 1000 ), min: 1.98 ms, max: 5.97 ms, nesting: 0 - 1000
Looks very good indeed! MSBT19 on the other hand looks surprising...
Best regards,
Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #53922 is a reply to message #53918] |
Fri, 15 May 2020 13:15   |
 |
mirek
Messages: 14257 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Fri, 15 May 2020 12:08
Additionally, plain memset, memsets and memsetd -variants would be useful for various tasks, as their efficiency varies depending on the compiler.
What about this:
void FillCacheLines(void *cache_aligned_ptr, void *data16, int count)
{
dword *t = (dword *)cache_aligned_ptr;
__m128d val = _mm_loadu_pd((double *)data16);
dword *e = t + 16 * count;
while(t < e) {
_mm_stream_pd((double *)t, val);
_mm_stream_pd((double *)(t + 4), val);
_mm_stream_pd((double *)(t + 8), val);
_mm_stream_pd((double *)(t + 12), val);
t += 16;
}
_mm_sfence();
}
template <class T>
void MemSet(void *dest, T data, int len)
{
static_assert(sizeof(T) == 1 || sizeof(T) == 2 || sizeof(T) == 4 || sizeof(T) == 8 || sizeof(T) == 16, "invalid sizeof");
T *t = (T *)dest;
if(len * sizeof(T) > 550) {
while((uintptr_t)t & 63) { // align to cache line
*t++ = data;
len--;
}
const int itemn = 16 / sizeof(T);
const int per_cache_line = 4 * itemn;
T m[itemn];
for(int i = 0; i < itemn; i++)
m[i] = data;
int count = len / per_cache_line;
FillCacheLines(t, m, count);
len -= per_cache_line * count;
}
while(len >= 16) {
t[0] = data; t[1] = data; t[2] = data; t[3] = data;
t[4] = data; t[5] = data; t[6] = data; t[7] = data;
t[8] = data; t[9] = data; t[10] = data; t[11] = data;
t[12] = data; t[13] = data; t[14] = data; t[15] = data;
t += 16;
len -= 16;
}
switch(len) {
case 15: t[14] = data;
case 14: t[13] = data;
case 13: t[12] = data;
case 12: t[11] = data;
case 11: t[10] = data;
case 10: t[9] = data;
case 9: t[8] = data;
case 8: t[7] = data;
case 7: t[6] = data;
case 6: t[5] = data;
case 5: t[4] = data;
case 4: t[3] = data;
case 3: t[2] = data;
case 2: t[1] = data;
case 1: t[0] = data;
}
}
|
|
|
Re: BufferPainter::Clear() optimization [message #53923 is a reply to message #53922] |
Fri, 15 May 2020 13:36   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Mirek,
Yes, absolutely beautiful!
The results for the set including the new MemSet() on Win10x64 on Core i9 are:
MSBT19:
TIMING MemSet : 831.06 ms - 831.06 us (831.13 ms / 1000 ), min: 779.00 us, max: 1.72 ms, nesting: 0 - 1000
TIMING SSEFill2 : 1.21 s - 1.21 ms ( 1.21 s / 1000 ), min: 1.00 ms, max: 2.19 ms, nesting: 0 - 1000
TIMING Fill : 915.70 ms - 915.70 us (915.76 ms / 1000 ), min: 859.00 us, max: 3.49 ms, nesting: 0 - 1000
MSBT19x64:
TIMING MemSet : 818.33 ms - 818.33 us (818.36 ms / 1000 ), min: 777.00 us, max: 1.71 ms, nesting: 0 - 1000
TIMING SSEFill2 : 899.74 ms - 899.74 us (899.77 ms / 1000 ), min: 854.00 us, max: 1.78 ms, nesting: 0 - 1000
TIMING Fill : 2.29 s - 2.29 ms ( 2.29 s / 1000 ), min: 2.21 ms, max: 4.51 ms, nesting: 0 - 1000
CLANG:
TIMING MemSet : 835.39 ms - 835.39 us (835.45 ms / 1000 ), min: 790.00 us, max: 1.51 ms, nesting: 0 - 1000
TIMING SSEFill2 : 918.63 ms - 918.63 us (918.68 ms / 1000 ), min: 872.00 us, max: 1.47 ms, nesting: 0 - 1000
TIMING Fill : 2.36 s - 2.36 ms ( 2.36 s / 1000 ), min: 2.28 ms, max: 5.45 ms, nesting: 0 - 1000
CLANGx64:
TIMING MemSet : 838.86 ms - 838.86 us (838.89 ms / 1000 ), min: 787.00 us, max: 1.70 ms, nesting: 0 - 1000
TIMING SSEFill2 : 921.49 ms - 921.49 us (921.51 ms / 1000 ), min: 870.00 us, max: 1.84 ms, nesting: 0 - 1000
TIMING Fill : 2.10 s - 2.10 ms ( 2.10 s / 1000 ), min: 2.01 ms, max: 5.00 ms, nesting: 0 - 1000
I trust you can now make all the different fillers through U++ to use this new code... right?
Thanks and best regards,
Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #53934 is a reply to message #53923] |
Fri, 15 May 2020 23:13   |
Tom1
Messages: 1303 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi Mirek,
The game is not over yet, I'm afraid. I did some additional benchmarking with varying buffer lengths to set. It get's more complicated...
RGBA c = Red();
int bsize=8*1024*1024;
Buffer<RGBA> b(bsize,(RGBA)Blue());
String result="\"N\",\"Fill()\",\"memsetd()\",\"MemSet()\"\r\n";
for(int len=1;len<=bsize;len*=2){
int maximum=1000000000/len;
int64 t0=usecs();
for(int i = 0; i < maximum; i++) Fill(~b, c, len);
int64 t1=usecs(t0);
t0=usecs();
for(int i = 0; i < maximum; i++) memsetd(~b, *(dword*)&(c), len);
int64 t2=usecs(t0);
t0=usecs();
for(int i = 0; i < maximum; i++) MemSet(~b, c, len);
int64 t3=usecs(t0);
result.Cat(Format("%d,%f,%f,%f\r\n",len,1000.0*t1/maximum,1000.0*t2/maximum,1000.0*t3/maximum));
}
SaveFile(GetHomeDirFile("Desktop/memset.csv"),result);
Now, if you import the resulting memset.csv to your spreadsheet program and create a log-log plot, you will see that the different buffer lengths have a huge impact on the performance of each algorithm. As filling lengths can be quite diverse, I think we need to think about some combination of the different algorithms. Additionally, we need to look at the results on different CPUs. I will keep tinkering on this one for a while here.
(Now I'm running on Core i7 here at home, so this one I can test easily, and also the Core i9 at the office every now and then, as the situation is what it is...)
Best regards,
Tom
|
|
|
Re: BufferPainter::Clear() optimization [message #53935 is a reply to message #53923] |
Fri, 15 May 2020 23:45   |
Didier
Messages: 726 Registered: November 2008 Location: France
|
Contributor |
|
|
Here is what I get on my Linux and Ryzen 2700
Du to unstable results with 10 loops, I also placed results for 1000 loops
The new MemSet() is definitly really a good addition 
==== CLANG X64 ====
TIMING MemSet : 10.00 ms - 999.98 us (10.00 ms / 10 ), min: 1.00 ms, max: 1.00 ms, nesting: 0 - 10
TIMING SSEFill2 : 12.00 ms - 1.20 ms (12.00 ms / 10 ), min: 1.00 ms, max: 2.00 ms, nesting: 0 - 10
TIMING CoFill : 21.00 ms - 2.10 ms (21.00 ms / 10 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 3.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING memsetd : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 9.00 ms, nesting: 0 - 10
TIMING MemSet : 833.97 ms - 833.97 us (834.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING SSEFill2 : 870.97 ms - 870.97 us (871.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING CoFill : 1.88 s - 1.88 ms ( 1.88 s / 1000 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 1000
TIMING Fill : 2.90 s - 2.90 ms ( 2.90 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.51 s - 2.51 ms ( 2.51 s / 1000 ), min: 2.00 ms, max: 10.00 ms, nesting: 0 - 1000
==== GCC X64 ====
TIMING MemSet : 7.00 ms - 699.98 us ( 7.00 ms / 10 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 10
TIMING SSEFill2 : 9.00 ms - 899.98 us ( 9.00 ms / 10 ), min: 0.00 ns, max: 1.00 ms, nesting: 0 - 10
TIMING CoFill : 23.00 ms - 2.30 ms (23.00 ms / 10 ), min: 2.00 ms, max: 3.00 ms, nesting: 0 - 10
TIMING Fill : 30.00 ms - 3.00 ms (30.00 ms / 10 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 10
TIMING memsetd : 35.00 ms - 3.50 ms (35.00 ms / 10 ), min: 2.00 ms, max: 10.00 ms, nesting: 0 - 10
TIMING MemSet : 820.98 ms - 820.98 us (821.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING SSEFill2 : 877.98 ms - 877.98 us (878.00 ms / 1000 ), min: 0.00 ns, max: 2.00 ms, nesting: 0 - 1000
TIMING CoFill : 1.85 s - 1.85 ms ( 1.85 s / 1000 ), min: 1.00 ms, max: 3.00 ms, nesting: 0 - 1000
TIMING Fill : 2.97 s - 2.97 ms ( 2.97 s / 1000 ), min: 2.00 ms, max: 4.00 ms, nesting: 0 - 1000
TIMING memsetd : 2.52 s - 2.52 ms ( 2.52 s / 1000 ), min: 2.00 ms, max: 8.00 ms, nesting: 0 - 1000
|
|
|
Goto Forum:
Current Time: Tue May 13 01:41:34 CEST 2025
Total time taken to generate the page: 0.03820 seconds
|