U++ forum: Welcome to the forum

Search on this site

Search in forums

Home » Extra libraries, Code snippets, applications etc. » C++ language problems and code snippets » Optimized memcmp for x86

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Optimized memcmp for x86 [message #14308]

Fri, 22 February 2008 11:07

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

Well, this code seems to run 20% faster than intrinsic GCC memcmp on x86-64:

#ifdef COMPILER_GCC
inline dword _byteswap_ulong(dword x)
{
	asm("bswap %0" : "=r" (x) : "0" (x));
	return x;
}

inline uint64 _byteswap_uint64(uint64 x)
{
	asm("bswap %0" : "=r" (x) : "0" (x));
	return x;
}

inline word _byteswap_ushort(word x)
{
	__asm__("xchgb %b0,%h0" : "=q" (x) :  "0" (x));
	return x;
}
#endif

int MemCmp(const char *a, const char *b, size_t len)
{
	if(((size_t)a & 3) | ((size_t)b & 3))
		return memcmp(a, b, len);
	const dword *x = (dword *)a;
	const dword *y = (dword *)b;
	const dword *e = x + (len >> 2);
	while(x < e) {
		if(*x != *y)
			return int(_byteswap_ulong(*x) - _byteswap_ulong(*y));
		x++;
		y++;
	}
	if(len & 2)
		if(*(word *)x != *(word *)y)
			return int(_byteswap_ushort(*(word *)x) - _byteswap_ushort(*(word *)y));
	if(len & 1)
		return int(*((byte *)x + 2)) - int(*((byte *)y + 2));
	return 0;
}

(Obviously, when both areas are dword aligned, but that happens a lot...).

Mirek

Report message to a moderator

Re: Optimized memcmp for x86 [message #14335 is a reply to message #14308]

Fri, 22 February 2008 19:08

mr_ped
Messages: 825
Registered: November 2005
Location: Czech Republic - Praha

Experienced Contributor

luzr wrote on Fri, 22 February 2008 11:07

	if(len & 2)
		if(*(word *)x != *(word *)y)
			return int(_byteswap_ushort(*(word *)x) - _byteswap_ushort(*(word *)y));
	if(len & 1)
		return int(*((byte *)x + 2)) - int(*((byte *)y + 2));

I don't get this end.

switch (len & 3)
0: it looks ok to me.
1: the return int(*((byte *)x)) - int(*((byte *)y)); should be returned?
2: looks ok
3: looks ok

I would maybe try masking out unused bytes, but that would lead to read out of buffer boundary. Is it safe?
I mean something like this

   ...
   const static dword masks[4] = { 0x00000000, 0x000000FF, 0x0000FFFF, 0x00FFFFFF };  //Intel-like endian only!
   return int(_byteswap_ulong(*x & masks[len&3]) - _byteswap_ulong(*y & masks[len&3]));

I'm not sure I got the byteswap purpose correctly, but I think I got, so my code is probably ok (but I didn't test it).

Of course it reads beyond buffer end, so you need to know it will not raise exception or crash the application on target platform.

[Updated on: Fri, 22 February 2008 19:11]

Report message to a moderator

Re: Optimized memcmp for x86 [message #14337 is a reply to message #14308]

Fri, 22 February 2008 19:52

mr_ped
Messages: 825
Registered: November 2005
Location: Czech Republic - Praha

Experienced Contributor

oh, and a very very minor optimization (which is probably done by compiler itself anyway?)

The line:
if(((size_t)a & 3) | ((size_t)b & 3))
can be:
if((((size_t)a) | ((size_t)b)) & 3)
(one bit-and less)

Report message to a moderator

Re: Optimized memcmp for x86 [message #14357 is a reply to message #14335]

Sat, 23 February 2008 16:15

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

mr_ped wrote on Fri, 22 February 2008 13:08

luzr wrote on Fri, 22 February 2008 11:07

	if(len & 2)
		if(*(word *)x != *(word *)y)
			return int(_byteswap_ushort(*(word *)x) - _byteswap_ushort(*(word *)y));
	if(len & 1)
		return int(*((byte *)x + 2)) - int(*((byte *)y + 2));

   ...
   const static dword masks[4] = { 0x00000000, 0x000000FF, 0x0000FFFF, 0x00FFFFFF };  //Intel-like endian only!
   return int(_byteswap_ulong(*x & masks[len&3]) - _byteswap_ulong(*y & masks[len&3]));

Yes, you are right, len&1 was wrong... (now is not it nice to have the code checked by posting here? Smile

And the idea of mask is interesting, but poses interesting problem as well:

We are not guaranteed by "len" parameter that we can read the whole dword. At the same time, "external" (like memory model and allocator) seem to guarantee that (because we have started on aligned...). I will have to think hard about this Wink

Mirek

Report message to a moderator

Re: Optimized memcmp for x86 [message #14360 is a reply to message #14308]

Sat, 23 February 2008 18:56

mr_ped
Messages: 825
Registered: November 2005
Location: Czech Republic - Praha

Experienced Contributor

Yes, that's the problem.
In good old days of ZX Spectrum I was sure I can read beyond "len" and nothing will happen (except getting weird data).

But nowadays I don't know so much about different platforms and OS to be sure you can read aligned double word without causing some exception or crash of application.

I think x86 works usually with aligned memory allocation, so you basically can NOT allocate like 13 bytes only, but the "THINK" word is the crucial part of this sentence.

The C++ itself does not do any memory read checking, so it's up to OS.
So I think the testing application which will allocate memory by OS mem allocator directly would give us the reliable answer.

Than again if such OS allocators allow to allocate only for example 4kB chunks and not 13 bytes, I think it will never raise exception or crash and you may safely read beyond end of buffer.

[Updated on: Sat, 23 February 2008 18:56]

Report message to a moderator

Re: Optimized memcmp for x86 [message #14363 is a reply to message #14360]

Sat, 23 February 2008 22:18

cbpporter
Messages: 1401
Registered: September 2007

Ultimate Contributor

Quote:

Than again if such OS allocators allow to allocate only for example 4kB chunks and not 13 bytes, I think it will never raise exception or crash and you may safely read beyond end of buffer.

That is true, but it will not allocate those 4KB for every 13 bytes you want, only if the previoslly allocated 4KB chunk is full. Your requested pointer may be on the end of that allocated zone, and here you could have big problems.

Anyway, this is a memcmp operation, so even if you don't crash, just getting gibberish data could compromise the functionality.

Report message to a moderator

Re: Optimized memcmp for x86 [message #14366 is a reply to message #14363]

Sat, 23 February 2008 23:23

mr_ped
Messages: 825
Registered: November 2005
Location: Czech Republic - Praha

Experienced Contributor

cbpporter wrote on Sat, 23 February 2008 22:18

Quote:

Than again if such OS allocators allow to allocate only for example 4kB chunks and not 13 bytes, I think it will never raise exception or crash and you may safely read beyond end of buffer.

If you get truly 13B from end of 4kB chunk, the starting address will be not aligned => classic memcmp will be called.

If starting pointer is aligned and you know the whole 4kB chunk is readable, you may safely read 4bytes even if the last 3 are beyond the original buffer, you can't cross 4kB chunk boundary in any case.

Those gibberish data are masked out before comparison. Smile

You should probably check the original routine and my suggestion firstly to get idea what's the problem with that last double word read from memory.

Report message to a moderator

Re: Optimized memcmp for x86 [message #14457 is a reply to message #14366]

Tue, 26 February 2008 22:07

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

mr_ped wrote on Sat, 23 February 2008 17:23

You should probably check the original routine and my suggestion firstly to get idea what's the problem with that last double word read from memory.

Yes. I believe you are correct.

(This is post-2008.1 stuff anyway, but it is nice to get ready Smile

Mirek

Report message to a moderator

Re: Optimized memcmp for x86 [message #14489 is a reply to message #14308]

Wed, 27 February 2008 11:43

mr_ped
Messages: 825
Registered: November 2005
Location: Czech Republic - Praha

Experienced Contributor

I was thinking about it a bit more, and I think some serious profiling data should be gathered, i.e. how often it is called with aligned pointers.

I think this function itself is rarely used?
And if it is, it may often be used to search trough strings?
And in such case the pointers will be not aligned very often?

I mean, is this really worth of effort? Only some profiling of real applications can tell.

But it was nice mental exercise anyway. Smile

Report message to a moderator

Re: Optimized memcmp for x86 [message #14756 is a reply to message #14489]

Wed, 12 March 2008 21:27

mirek
Messages: 13975
Registered: November 2005

Ultimate Member

mr_ped wrote on Wed, 27 February 2008 05:43

Well, the primary motivation was the speedup for small strings - there data definitely are aligned and even the number of characters is fixed.

This optimization seems to bring in about 5% improvement to container benchmark where Sort is used. That is not bad.

Anyway, for non-small strings, we do have guarantee that data are always aligned, so I will probably directly use even unaligned version Smile

For String alone, this optimization is worthwhile.

Mirek

Report message to a moderator

Previous Topic:	multiple classes include-problem
Next Topic:	What does , means?

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Tue Apr 16 09:57:19 CEST 2024

Total time taken to generate the page: 0.55617 seconds