Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Community » U++ community news and announcements » String::Cat optimization
String::Cat optimization [message #34575] Wed, 30 November 2011 17:58 Go to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
By replacing memcpy with

inline void svo_memcpy(char *t, const char *s, int len)
{
	switch(len) {
	case 15: t[14] = s[14];
	case 14: t[13] = s[13];
	case 13: t[12] = s[12];
	case 12: t[11] = s[11];
	case 11: t[10] = s[10];
	case 10: t[9] = s[9];
	case  9: t[8] = s[8];
	case  8: t[7] = s[7];
	case  7: t[6] = s[6];
	case  6: t[5] = s[5];
	case  5: t[4] = s[4];
	case  4: t[3] = s[3];
	case  3: t[2] = s[2];
	case  2: t[1] = s[1];
	case  1: t[0] = s[0];
		return;
	}
	memcpy(t, s, len);
}


I have recieved about 10% improvemenced in the String::Cat for small values.
Re: String::Cat optimization [message #34576 is a reply to message #34575] Wed, 30 November 2011 18:11 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3357
Registered: August 2008
Senior Veteran
Amazing Very Happy Very Happy . Is it the same in different environments?

Best regards
Iñaki
Re: String::Cat optimization [message #34577 is a reply to message #34576] Wed, 30 November 2011 18:17 Go to previous messageGo to next message
dolik.rce is currently offline  dolik.rce
Messages: 1789
Registered: August 2008
Location: Czech Republic
Ultimate Contributor

Huh... and I always thought memcpy is the fastest way to copy things Shocked

Any ideas how does the compiler optimization magic works in this case? I'd like to understand, it might be useful in other situations as well.

Honza
Re: String::Cat optimization [message #34583 is a reply to message #34577] Wed, 30 November 2011 21:15 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
dolik.rce wrote on Wed, 30 November 2011 12:17

Huh... and I always thought memcpy is the fastest way to copy things Shocked

Any ideas how does the compiler optimization magic works in this case? I'd like to understand, it might be useful in other situations as well.

Honza


Seriously, I am really ambiguos about this optimization, it is really border case. Plus I have only tested with MSC.

Anyway, looking at assembly code, memcpy is really optimized pretty well, but spends a lot of time detecting heavy-lifting scenario (like target and source both aligned etc...), whereas String is all about adding small pieces of data.

The switch leads to simple jump to 'multiplied' position and then 'linear' code up to end. It is a very little bit faster..

All in all, perhaps more data are needed. It should be easy to #ifdef svo_memcpy to regular memcpy...

My benchmarking code was something like this:

	String str;
	for(int i = 0; i < 10000000; i++) {
		str.Clear();
		RTIMING("Cat 18");
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
	}
	for(int i = 0; i < 10000000; i++) {
		str.Clear();
		RTIMING("Cat 40");
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
		str.Cat("Hello", 5);
	}


before optimization

TIMING Cat 40         :  1.98 s  - 198.46 ns ( 2.17 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 591.60 ms - 59.16 ns (772.00 ms / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000


after

TIMING Cat 40         :  1.48 s  - 148.37 ns ( 1.68 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 482.71 ms - 48.27 ns (676.00 ms / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

[Updated on: Wed, 30 November 2011 21:16]

Report message to a moderator

Re: String::Cat optimization [message #34585 is a reply to message #34583] Thu, 01 December 2011 06:18 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3357
Registered: August 2008
Senior Veteran
More cases

There is a slight improvement in MSC but a big one in MinGW.

MSC10 Speed
- Standard
TIMING Cat 40         :  2.34 s  - 233.71 ns ( 2.56 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 839.12 ms - 83.91 ns ( 1.06 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.24 s  - 223.96 ns ( 2.44 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 788.63 ms - 78.86 ns (987.00 ms / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MSC10 Optimal
- Standard
TIMING Cat 40         :  2.37 s  - 237.30 ns ( 2.57 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 857.95 ms - 85.80 ns ( 1.05 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.27 s  - 226.80 ns ( 2.48 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 893.96 ms - 89.40 ns ( 1.10 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MSC9 Speed
- Standard
TIMING Cat 40         :  2.38 s  - 238.20 ns ( 2.61 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 884.96 ms - 88.50 ns ( 1.12 s  / 10000000 ), min:  0.00 ns, max:  5.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.13 s  - 212.92 ns ( 2.34 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 866.17 ms - 86.62 ns ( 1.07 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MSC9 Optimal
- Standard
TIMING Cat 40         :  3.04 s  - 304.05 ns ( 3.24 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  1.04 s  - 103.95 ns ( 1.24 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.48 s  - 248.49 ns ( 2.67 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 915.92 ms - 91.59 ns ( 1.10 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MINGW 4.5.2 Speed
- Standard
TIMING Cat 40         :  5.59 s  - 558.74 ns ( 5.85 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  1.89 s  - 189.04 ns ( 2.15 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.91 s  - 290.74 ns ( 3.18 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 853.43 ms - 85.34 ns ( 1.13 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MINGW 4.5.2 Optimal
- Standard
TIMING Cat 40         :  5.54 s  - 554.21 ns ( 5.84 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  1.81 s  - 180.81 ns ( 2.11 s  / 10000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 1 - 10000000

- Experimental
TIMING Cat 40         :  2.84 s  - 284.44 ns ( 3.14 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 824.40 ms - 82.44 ns ( 1.12 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000



Best regards
Iñaki
Re: String::Cat optimization [message #34586 is a reply to message #34585] Thu, 01 December 2011 08:04 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
koldo wrote on Thu, 01 December 2011 00:18

More cases

There is a slight improvement in MSC but a big one in MinGW.

MSC10 Speed
- Standard
TIMING Cat 40         :  2.34 s  - 233.71 ns ( 2.56 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 839.12 ms - 83.91 ns ( 1.06 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.24 s  - 223.96 ns ( 2.44 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 788.63 ms - 78.86 ns (987.00 ms / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MSC10 Optimal
- Standard
TIMING Cat 40         :  2.37 s  - 237.30 ns ( 2.57 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 857.95 ms - 85.80 ns ( 1.05 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.27 s  - 226.80 ns ( 2.48 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 893.96 ms - 89.40 ns ( 1.10 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MSC9 Speed
- Standard
TIMING Cat 40         :  2.38 s  - 238.20 ns ( 2.61 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 884.96 ms - 88.50 ns ( 1.12 s  / 10000000 ), min:  0.00 ns, max:  5.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.13 s  - 212.92 ns ( 2.34 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 866.17 ms - 86.62 ns ( 1.07 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MSC9 Optimal
- Standard
TIMING Cat 40         :  3.04 s  - 304.05 ns ( 3.24 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  1.04 s  - 103.95 ns ( 1.24 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.48 s  - 248.49 ns ( 2.67 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 915.92 ms - 91.59 ns ( 1.10 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MINGW 4.5.2 Speed
- Standard
TIMING Cat 40         :  5.59 s  - 558.74 ns ( 5.85 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  1.89 s  - 189.04 ns ( 2.15 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40         :  2.91 s  - 290.74 ns ( 3.18 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 853.43 ms - 85.34 ns ( 1.13 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

MINGW 4.5.2 Optimal
- Standard
TIMING Cat 40         :  5.54 s  - 554.21 ns ( 5.84 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  1.81 s  - 180.81 ns ( 2.11 s  / 10000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 1 - 10000000

- Experimental
TIMING Cat 40         :  2.84 s  - 284.44 ns ( 3.14 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 824.40 ms - 82.44 ns ( 1.12 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000




Well, I guess this sort of vindicates the optimization...

Mirek
Re: String::Cat optimization [message #34588 is a reply to message #34586] Thu, 01 December 2011 08:39 Go to previous messageGo to next message
koldo is currently offline  koldo
Messages: 3357
Registered: August 2008
Senior Veteran
Please could somebody try it in Linux?

Best regards
Iñaki
Re: String::Cat optimization [message #34590 is a reply to message #34588] Thu, 01 December 2011 10:21 Go to previous messageGo to next message
dolik.rce is currently offline  dolik.rce
Messages: 1789
Registered: August 2008
Location: Czech Republic
Ultimate Contributor

koldo wrote on Thu, 01 December 2011 08:39

Please could somebody try it in Linux?

Here it is:
GCC-4.6.2 Optimal with svo_memcpy
TIMING Cat 40         :  2.65 s  - 265.34 ns (11.06 s  / 10000000 ), min:  0.00 ns, max:  5.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 7.88 s  / 10000000 ), min:  0.00 ns, max:  3.00 ms, nesting: 1 - 10000000

GCC-4.6.2 Optimal with memcpy
TIMING Cat 40         :  2.94 s  - 293.57 ns (11.13 s  / 10000000 ), min:  0.00 ns, max:  4.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 8.11 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000


GCC-4.6.2 Speed with svo_memcpy
TIMING Cat 40         : 246.75 ms - 24.68 ns (11.14 s  / 10000000 ), min:  0.00 ns, max:  4.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 7.93 s  / 10000000 ), min:  0.00 ns, max:  4.00 ms, nesting: 1 - 10000000

GCC-4.6.2 Speed with memcpy
TIMING Cat 40         :  2.79 s  - 279.30 ns (11.45 s  / 10000000 ), min:  0.00 ns, max:  6.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  2.03 s  - 202.80 ns (10.68 s  / 10000000 ), min:  0.00 ns, max:  4.00 ms, nesting: 1 - 10000000



CLANG-2.9 Optimal with svo_memcpy
TIMING Cat 40         :  1.65 s  - 165.35 ns (11.07 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 8.09 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

CLANG-2.9 Optimal with memcpy
TIMING Cat 40         :  4.20 s  - 420.01 ns (11.33 s  / 10000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  1.20 s  - 119.81 ns ( 8.32 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000


CLANG-2.9 Speed with svo_memcpy
TIMING Cat 40         :  4.28 s  - 428.33 ns (11.21 s  / 10000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 797.30 ms - 79.73 ns ( 7.72 s  / 10000000 ), min:  0.00 ns, max:  5.00 ms, nesting: 1 - 10000000

CLANG-2.9 Speed with memcpy
TIMING Cat 40         :  5.87 ms -  0.59 ns (11.09 s  / 10000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 8.36 s  / 10000000 ), min:  0.00 ns, max:  5.00 ms, nesting: 1 - 10000000


Honza
Re: String::Cat optimization [message #34591 is a reply to message #34590] Thu, 01 December 2011 11:37 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
dolik.rce wrote on Thu, 01 December 2011 04:21

koldo wrote on Thu, 01 December 2011 08:39

Please could somebody try it in Linux?

Here it is:
GCC-4.6.2 Optimal with svo_memcpy
TIMING Cat 40         :  2.65 s  - 265.34 ns (11.06 s  / 10000000 ), min:  0.00 ns, max:  5.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 7.88 s  / 10000000 ), min:  0.00 ns, max:  3.00 ms, nesting: 1 - 10000000

GCC-4.6.2 Optimal with memcpy
TIMING Cat 40         :  2.94 s  - 293.57 ns (11.13 s  / 10000000 ), min:  0.00 ns, max:  4.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 8.11 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000


GCC-4.6.2 Speed with svo_memcpy
TIMING Cat 40         : 246.75 ms - 24.68 ns (11.14 s  / 10000000 ), min:  0.00 ns, max:  4.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 7.93 s  / 10000000 ), min:  0.00 ns, max:  4.00 ms, nesting: 1 - 10000000

GCC-4.6.2 Speed with memcpy
TIMING Cat 40         :  2.79 s  - 279.30 ns (11.45 s  / 10000000 ), min:  0.00 ns, max:  6.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  2.03 s  - 202.80 ns (10.68 s  / 10000000 ), min:  0.00 ns, max:  4.00 ms, nesting: 1 - 10000000



CLANG-2.9 Optimal with svo_memcpy
TIMING Cat 40         :  1.65 s  - 165.35 ns (11.07 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 8.09 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000

CLANG-2.9 Optimal with memcpy
TIMING Cat 40         :  4.20 s  - 420.01 ns (11.33 s  / 10000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  1.20 s  - 119.81 ns ( 8.32 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000


CLANG-2.9 Speed with svo_memcpy
TIMING Cat 40         :  4.28 s  - 428.33 ns (11.21 s  / 10000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 797.30 ms - 79.73 ns ( 7.72 s  / 10000000 ), min:  0.00 ns, max:  5.00 ms, nesting: 1 - 10000000

CLANG-2.9 Speed with memcpy
TIMING Cat 40         :  5.87 ms -  0.59 ns (11.09 s  / 10000000 ), min:  0.00 ns, max:  2.00 ms, nesting: 1 - 10000000
TIMING Cat 18         :  0.00 ns -  0.00 ns ( 8.36 s  / 10000000 ), min:  0.00 ns, max:  5.00 ms, nesting: 1 - 10000000


Honza


Looking at it, it seems like there is something fishy about times... maybe something with TIMING is now broken? Or perhaps it does bot behave well now in linux?
Re: String::Cat optimization [message #34592 is a reply to message #34591] Thu, 01 December 2011 12:16 Go to previous messageGo to next message
dolik.rce is currently offline  dolik.rce
Messages: 1789
Registered: August 2008
Location: Czech Republic
Ultimate Contributor

mirek wrote on Thu, 01 December 2011 11:37

Looking at it, it seems like there is something fishy about times... maybe something with TIMING is now broken? Or perhaps it does bot behave well now in linux?
I think there is something wrong with the TIMING macros on Linux. I get weird numbers from time to time, like 0.0ns calls above...

Honza
Re: String::Cat optimization [message #34594 is a reply to message #34591] Thu, 01 December 2011 14:38 Go to previous messageGo to next message
unodgs is currently offline  unodgs
Messages: 1366
Registered: November 2005
Location: Poland
Ultimate Contributor

Quote:


Looking at it, it seems like there is something fishy about times... maybe something with TIMING is now broken? Or perhaps it does bot behave well now in linux?


Not only in linux. In Windows GetTickCount is used which resolution is in range 10 - 16 ms only.
Re: String::Cat optimization [message #34595 is a reply to message #34592] Thu, 01 December 2011 14:50 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

I played around with Mirek's idea awhile and according to my simple '::GetTickCount()' benchmarking on MSC9/Win7x64 I managed to squeeze yet more performance out of it. The test covered all transfer lengths from 1 to 16 bytes.

The svo_memcpy() suffers a performance penalty at len==16, where secondary function call to memcpy steps in. The following macro approach helps dramatically to reduce that penalty. I also discovered that the memcpy() performance might not be reached systematically at transfer lengths above 11 bytes, so limiting the switch to <= 11 bytes should improve overall performance.

	inline void memcpy11i(char *t, const char *s, int len){
		switch(len) {
			case 11: t[10] = s[10];
			case 10: t[9] = s[9];
			case  9: t[8] = s[8];
			case  8: t[7] = s[7];
			case  7: t[6] = s[6];
			case  6: t[5] = s[5];
			case  5: t[4] = s[4];
			case  4: t[3] = s[3];
			case  3: t[2] = s[2];
			case  2: t[1] = s[1];
			case  1: t[0] = s[0];
		}
	}

#define	memcpy11(t, s, len) (len)>11 ? memcpy(t, s, len) : memcpy11i(t, s, len)



How does this perform on your systems?

Best regards,

Tom
Re: String::Cat optimization [message #34596 is a reply to message #34594] Thu, 01 December 2011 14:54 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Uno,

It is true that GetTickCount() runs by default on a 10/16 ms resolution only. However, that does not affect the results too much if your measurement period is on the order of one second or more. (I used 100000000 repetitions to get around this resolution issue.)

Best regards,

Tom
Re: String::Cat optimization [message #34603 is a reply to message #34594] Thu, 01 December 2011 17:51 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
unodgs wrote on Thu, 01 December 2011 08:38

Quote:


Looking at it, it seems like there is something fishy about times... maybe something with TIMING is now broken? Or perhaps it does bot behave well now in linux?


Not only in linux. In Windows GetTickCount is used which resolution is in range 10 - 16 ms only.


That is actually OK, as long as the number of passes is big enough... (it statistically averages out, so with 10ms resolution you can successfully measure ns times - funny, is not it? Smile
Re: String::Cat optimization [message #34604 is a reply to message #34595] Thu, 01 December 2011 17:52 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Thu, 01 December 2011 08:50

Hi,

I played around with Mirek's idea awhile and according to my simple '::GetTickCount()' benchmarking on MSC9/Win7x64 I managed to squeeze yet more performance out of it. The test covered all transfer lengths from 1 to 16 bytes.

The svo_memcpy() suffers a performance penalty at len==16, where secondary function call to memcpy steps in. The following macro approach helps dramatically to reduce that penalty. I also discovered that the memcpy() performance might not be reached systematically at transfer lengths above 11 bytes, so limiting the switch to <= 11 bytes should improve overall performance.

	inline void memcpy11i(char *t, const char *s, int len){
		switch(len) {
			case 11: t[10] = s[10];
			case 10: t[9] = s[9];
			case  9: t[8] = s[8];
			case  8: t[7] = s[7];
			case  7: t[6] = s[6];
			case  6: t[5] = s[5];
			case  5: t[4] = s[4];
			case  4: t[3] = s[3];
			case  3: t[2] = s[2];
			case  2: t[1] = s[1];
			case  1: t[0] = s[0];
		}
	}

#define	memcpy11(t, s, len) (len)>11 ? memcpy(t, s, len) : memcpy11i(t, s, len)



How does this perform on your systems?

Best regards,

Tom



Well, that is actually even better, as MSC refuses to inline that function....

Mirek
Re: String::Cat optimization [message #34608 is a reply to message #34595] Thu, 01 December 2011 20:41 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Thu, 01 December 2011 08:50

Hi,

I played around with Mirek's idea awhile and according to my simple '::GetTickCount()' benchmarking on MSC9/Win7x64 I managed to squeeze yet more performance out of it. The test covered all transfer lengths from 1 to 16 bytes.

The svo_memcpy() suffers a performance penalty at len==16, where secondary function call to memcpy steps in. The following macro approach helps dramatically to reduce that penalty. I also discovered that the memcpy() performance might not be reached systematically at transfer lengths above 11 bytes, so limiting the switch to <= 11 bytes should improve overall performance.

	inline void memcpy11i(char *t, const char *s, int len){
		switch(len) {
			case 11: t[10] = s[10];
			case 10: t[9] = s[9];
			case  9: t[8] = s[8];
			case  8: t[7] = s[7];
			case  7: t[6] = s[6];
			case  6: t[5] = s[5];
			case  5: t[4] = s[4];
			case  4: t[3] = s[3];
			case  3: t[2] = s[2];
			case  2: t[1] = s[1];
			case  1: t[0] = s[0];
		}
	}

#define	memcpy11(t, s, len) (len)>11 ? memcpy(t, s, len) : memcpy11i(t, s, len)



How does this perform on your systems?

Best regards,

Tom



Thanks, this is even better. Somehow I forgot that if compiler decides not to inline something, I can still force it by macro.

So, I am following your advice, using macro and limit:

#define SVO_MEMCPY(tgt, src, len) \
do { \
	const char *s__ = (const char *)(src); \
	char *t__ = (char *)(tgt); \
	switch(len) { \
	case 11: t__[10] = s__[10]; \
	case 10: t__[9] = s__[9]; \
	case  9: t__[8] = s__[8]; \
	case  8: t__[7] = s__[7]; \
	case  7: t__[6] = s__[6]; \
	case  6: t__[5] = s__[5]; \
	case  5: t__[4] = s__[4]; \
	case  4: t__[3] = s__[3]; \
	case  3: t__[2] = s__[2]; \
	case  2: t__[1] = s__[1]; \
	case  1: t__[0] = s__[0]; \
		break; \
	default: \
		memcpy(t__, s__, len); \
	} \
} while(false)


memcpy
TIMING Cat 40         :  1.98 s  - 198.34 ns ( 2.15 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 599.44 ms - 59.94 ns (767.00 ms / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000


inline svo_memcpy
TIMING Cat 40         :  1.45 s  - 145.38 ns ( 1.63 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 491.79 ms - 49.18 ns (671.00 ms / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000


macro
TIMING Cat 40         :  1.04 s  - 103.71 ns ( 1.22 s  / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000
TIMING Cat 18         : 302.09 ms - 30.21 ns (486.00 ms / 10000000 ), min:  0.00 ns, max:  1.00 ms, nesting: 1 - 10000000


Amazing Smile

It is interesting how year after year, we can still squeeze a bit from String.... Smile

Mirek

Re: String::Cat optimization [message #34612 is a reply to message #34608] Fri, 02 December 2011 03:59 Go to previous messageGo to next message
Lance is currently offline  Lance
Messages: 526
Registered: March 2007
Contributor
Impressive.


Depending on how likely 0 length memory is "copied", it may be desirable to add a branch to handle it.


	case  1: t__[0] = s__[0]; \
        case  0: \
		break; \


Tom's code takes care of 0 length string without memcpy, just as above.
Re: String::Cat optimization [message #34613 is a reply to message #34608] Fri, 02 December 2011 06:02 Go to previous messageGo to next message
Novo is currently offline  Novo
Messages: 1358
Registered: December 2006
Ultimate Contributor
mirek wrote on Thu, 01 December 2011 14:41


Thanks, this is even better. Somehow I forgot that if compiler decides not to inline something, I can still force it by macro.



There is an easier way to force inlining:

MSVC - __forceinline
GCC - __attribute__((always_inline))


Regards,
Novo
Re: String::Cat optimization [message #34614 is a reply to message #34612] Fri, 02 December 2011 06:39 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Lance wrote on Thu, 01 December 2011 21:59

Impressive.


Depending on how likely 0 length memory is "copied", it may be desirable to add a branch to handle it.


	case  1: t__[0] = s__[0]; \
        case  0: \
		break; \


Tom's code takes care of 0 length string without memcpy, just as above.


Yes. Thanks.

Mirek
Re: String::Cat optimization [message #34615 is a reply to message #34613] Fri, 02 December 2011 06:40 Go to previous messageGo to previous message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Novo wrote on Fri, 02 December 2011 00:02

mirek wrote on Thu, 01 December 2011 14:41


Thanks, this is even better. Somehow I forgot that if compiler decides not to inline something, I can still force it by macro.



There is an easier way to force inlining:

MSVC - __forceinline
GCC - __attribute__((always_inline))


Is it really easier? Smile

Mirek
Previous Topic: Please update sources
Next Topic: New GDB frontend for Theide
Goto Forum:
  


Current Time: Thu Apr 25 06:13:04 CEST 2024

Total time taken to generate the page: 0.03097 seconds