Home » Community » U++ community news and announcements » String::Cat optimization
|
|
Re: String::Cat optimization [message #34577 is a reply to message #34576] |
Wed, 30 November 2011 18:17 |
|
Huh... and I always thought memcpy is the fastest way to copy things
Any ideas how does the compiler optimization magic works in this case? I'd like to understand, it might be useful in other situations as well.
Honza
|
|
|
Re: String::Cat optimization [message #34583 is a reply to message #34577] |
Wed, 30 November 2011 21:15 |
|
mirek
Messages: 13976 Registered: November 2005
|
Ultimate Member |
|
|
dolik.rce wrote on Wed, 30 November 2011 12:17 | Huh... and I always thought memcpy is the fastest way to copy things
Any ideas how does the compiler optimization magic works in this case? I'd like to understand, it might be useful in other situations as well.
Honza
|
Seriously, I am really ambiguos about this optimization, it is really border case. Plus I have only tested with MSC.
Anyway, looking at assembly code, memcpy is really optimized pretty well, but spends a lot of time detecting heavy-lifting scenario (like target and source both aligned etc...), whereas String is all about adding small pieces of data.
The switch leads to simple jump to 'multiplied' position and then 'linear' code up to end. It is a very little bit faster..
All in all, perhaps more data are needed. It should be easy to #ifdef svo_memcpy to regular memcpy...
My benchmarking code was something like this:
String str;
for(int i = 0; i < 10000000; i++) {
str.Clear();
RTIMING("Cat 18");
str.Cat("Hello", 5);
str.Cat("Hello", 5);
str.Cat("Hello", 5);
}
for(int i = 0; i < 10000000; i++) {
str.Clear();
RTIMING("Cat 40");
str.Cat("Hello", 5);
str.Cat("Hello", 5);
str.Cat("Hello", 5);
str.Cat("Hello", 5);
str.Cat("Hello", 5);
str.Cat("Hello", 5);
str.Cat("Hello", 5);
str.Cat("Hello", 5);
}
before optimization
TIMING Cat 40 : 1.98 s - 198.46 ns ( 2.17 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 591.60 ms - 59.16 ns (772.00 ms / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
after
TIMING Cat 40 : 1.48 s - 148.37 ns ( 1.68 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 482.71 ms - 48.27 ns (676.00 ms / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
[Updated on: Wed, 30 November 2011 21:16] Report message to a moderator
|
|
|
Re: String::Cat optimization [message #34585 is a reply to message #34583] |
Thu, 01 December 2011 06:18 |
|
koldo
Messages: 3361 Registered: August 2008
|
Senior Veteran |
|
|
More cases
There is a slight improvement in MSC but a big one in MinGW.
MSC10 Speed
- Standard
TIMING Cat 40 : 2.34 s - 233.71 ns ( 2.56 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 839.12 ms - 83.91 ns ( 1.06 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.24 s - 223.96 ns ( 2.44 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 788.63 ms - 78.86 ns (987.00 ms / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MSC10 Optimal
- Standard
TIMING Cat 40 : 2.37 s - 237.30 ns ( 2.57 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 857.95 ms - 85.80 ns ( 1.05 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.27 s - 226.80 ns ( 2.48 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 893.96 ms - 89.40 ns ( 1.10 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MSC9 Speed
- Standard
TIMING Cat 40 : 2.38 s - 238.20 ns ( 2.61 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 884.96 ms - 88.50 ns ( 1.12 s / 10000000 ), min: 0.00 ns, max: 5.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.13 s - 212.92 ns ( 2.34 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 866.17 ms - 86.62 ns ( 1.07 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MSC9 Optimal
- Standard
TIMING Cat 40 : 3.04 s - 304.05 ns ( 3.24 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 1.04 s - 103.95 ns ( 1.24 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.48 s - 248.49 ns ( 2.67 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 915.92 ms - 91.59 ns ( 1.10 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MINGW 4.5.2 Speed
- Standard
TIMING Cat 40 : 5.59 s - 558.74 ns ( 5.85 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 1.89 s - 189.04 ns ( 2.15 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.91 s - 290.74 ns ( 3.18 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 853.43 ms - 85.34 ns ( 1.13 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MINGW 4.5.2 Optimal
- Standard
TIMING Cat 40 : 5.54 s - 554.21 ns ( 5.84 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 1.81 s - 180.81 ns ( 2.11 s / 10000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.84 s - 284.44 ns ( 3.14 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 824.40 ms - 82.44 ns ( 1.12 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
Best regards
Iñaki
|
|
|
Re: String::Cat optimization [message #34586 is a reply to message #34585] |
Thu, 01 December 2011 08:04 |
|
mirek
Messages: 13976 Registered: November 2005
|
Ultimate Member |
|
|
koldo wrote on Thu, 01 December 2011 00:18 | More cases
There is a slight improvement in MSC but a big one in MinGW.
MSC10 Speed
- Standard
TIMING Cat 40 : 2.34 s - 233.71 ns ( 2.56 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 839.12 ms - 83.91 ns ( 1.06 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.24 s - 223.96 ns ( 2.44 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 788.63 ms - 78.86 ns (987.00 ms / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MSC10 Optimal
- Standard
TIMING Cat 40 : 2.37 s - 237.30 ns ( 2.57 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 857.95 ms - 85.80 ns ( 1.05 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.27 s - 226.80 ns ( 2.48 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 893.96 ms - 89.40 ns ( 1.10 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MSC9 Speed
- Standard
TIMING Cat 40 : 2.38 s - 238.20 ns ( 2.61 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 884.96 ms - 88.50 ns ( 1.12 s / 10000000 ), min: 0.00 ns, max: 5.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.13 s - 212.92 ns ( 2.34 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 866.17 ms - 86.62 ns ( 1.07 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MSC9 Optimal
- Standard
TIMING Cat 40 : 3.04 s - 304.05 ns ( 3.24 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 1.04 s - 103.95 ns ( 1.24 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.48 s - 248.49 ns ( 2.67 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 915.92 ms - 91.59 ns ( 1.10 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MINGW 4.5.2 Speed
- Standard
TIMING Cat 40 : 5.59 s - 558.74 ns ( 5.85 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 1.89 s - 189.04 ns ( 2.15 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.91 s - 290.74 ns ( 3.18 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 853.43 ms - 85.34 ns ( 1.13 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
MINGW 4.5.2 Optimal
- Standard
TIMING Cat 40 : 5.54 s - 554.21 ns ( 5.84 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 1.81 s - 180.81 ns ( 2.11 s / 10000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 1 - 10000000
- Experimental
TIMING Cat 40 : 2.84 s - 284.44 ns ( 3.14 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 824.40 ms - 82.44 ns ( 1.12 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
|
Well, I guess this sort of vindicates the optimization...
Mirek
|
|
|
|
Re: String::Cat optimization [message #34590 is a reply to message #34588] |
Thu, 01 December 2011 10:21 |
|
koldo wrote on Thu, 01 December 2011 08:39 | Please could somebody try it in Linux?
|
Here it is:GCC-4.6.2 Optimal with svo_memcpy
TIMING Cat 40 : 2.65 s - 265.34 ns (11.06 s / 10000000 ), min: 0.00 ns, max: 5.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 7.88 s / 10000000 ), min: 0.00 ns, max: 3.00 ms, nesting: 1 - 10000000
GCC-4.6.2 Optimal with memcpy
TIMING Cat 40 : 2.94 s - 293.57 ns (11.13 s / 10000000 ), min: 0.00 ns, max: 4.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 8.11 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
GCC-4.6.2 Speed with svo_memcpy
TIMING Cat 40 : 246.75 ms - 24.68 ns (11.14 s / 10000000 ), min: 0.00 ns, max: 4.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 7.93 s / 10000000 ), min: 0.00 ns, max: 4.00 ms, nesting: 1 - 10000000
GCC-4.6.2 Speed with memcpy
TIMING Cat 40 : 2.79 s - 279.30 ns (11.45 s / 10000000 ), min: 0.00 ns, max: 6.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 2.03 s - 202.80 ns (10.68 s / 10000000 ), min: 0.00 ns, max: 4.00 ms, nesting: 1 - 10000000
CLANG-2.9 Optimal with svo_memcpy
TIMING Cat 40 : 1.65 s - 165.35 ns (11.07 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 8.09 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
CLANG-2.9 Optimal with memcpy
TIMING Cat 40 : 4.20 s - 420.01 ns (11.33 s / 10000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 1.20 s - 119.81 ns ( 8.32 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
CLANG-2.9 Speed with svo_memcpy
TIMING Cat 40 : 4.28 s - 428.33 ns (11.21 s / 10000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 797.30 ms - 79.73 ns ( 7.72 s / 10000000 ), min: 0.00 ns, max: 5.00 ms, nesting: 1 - 10000000
CLANG-2.9 Speed with memcpy
TIMING Cat 40 : 5.87 ms - 0.59 ns (11.09 s / 10000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 8.36 s / 10000000 ), min: 0.00 ns, max: 5.00 ms, nesting: 1 - 10000000
Honza
|
|
|
Re: String::Cat optimization [message #34591 is a reply to message #34590] |
Thu, 01 December 2011 11:37 |
|
mirek
Messages: 13976 Registered: November 2005
|
Ultimate Member |
|
|
dolik.rce wrote on Thu, 01 December 2011 04:21 |
koldo wrote on Thu, 01 December 2011 08:39 | Please could somebody try it in Linux?
|
Here it is:GCC-4.6.2 Optimal with svo_memcpy
TIMING Cat 40 : 2.65 s - 265.34 ns (11.06 s / 10000000 ), min: 0.00 ns, max: 5.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 7.88 s / 10000000 ), min: 0.00 ns, max: 3.00 ms, nesting: 1 - 10000000
GCC-4.6.2 Optimal with memcpy
TIMING Cat 40 : 2.94 s - 293.57 ns (11.13 s / 10000000 ), min: 0.00 ns, max: 4.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 8.11 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
GCC-4.6.2 Speed with svo_memcpy
TIMING Cat 40 : 246.75 ms - 24.68 ns (11.14 s / 10000000 ), min: 0.00 ns, max: 4.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 7.93 s / 10000000 ), min: 0.00 ns, max: 4.00 ms, nesting: 1 - 10000000
GCC-4.6.2 Speed with memcpy
TIMING Cat 40 : 2.79 s - 279.30 ns (11.45 s / 10000000 ), min: 0.00 ns, max: 6.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 2.03 s - 202.80 ns (10.68 s / 10000000 ), min: 0.00 ns, max: 4.00 ms, nesting: 1 - 10000000
CLANG-2.9 Optimal with svo_memcpy
TIMING Cat 40 : 1.65 s - 165.35 ns (11.07 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 8.09 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
CLANG-2.9 Optimal with memcpy
TIMING Cat 40 : 4.20 s - 420.01 ns (11.33 s / 10000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 1.20 s - 119.81 ns ( 8.32 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
CLANG-2.9 Speed with svo_memcpy
TIMING Cat 40 : 4.28 s - 428.33 ns (11.21 s / 10000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 797.30 ms - 79.73 ns ( 7.72 s / 10000000 ), min: 0.00 ns, max: 5.00 ms, nesting: 1 - 10000000
CLANG-2.9 Speed with memcpy
TIMING Cat 40 : 5.87 ms - 0.59 ns (11.09 s / 10000000 ), min: 0.00 ns, max: 2.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 0.00 ns - 0.00 ns ( 8.36 s / 10000000 ), min: 0.00 ns, max: 5.00 ms, nesting: 1 - 10000000
Honza
|
Looking at it, it seems like there is something fishy about times... maybe something with TIMING is now broken? Or perhaps it does bot behave well now in linux?
|
|
|
Re: String::Cat optimization [message #34592 is a reply to message #34591] |
Thu, 01 December 2011 12:16 |
|
mirek wrote on Thu, 01 December 2011 11:37 | Looking at it, it seems like there is something fishy about times... maybe something with TIMING is now broken? Or perhaps it does bot behave well now in linux?
| I think there is something wrong with the TIMING macros on Linux. I get weird numbers from time to time, like 0.0ns calls above...
Honza
|
|
|
Re: String::Cat optimization [message #34594 is a reply to message #34591] |
Thu, 01 December 2011 14:38 |
|
Quote: |
Looking at it, it seems like there is something fishy about times... maybe something with TIMING is now broken? Or perhaps it does bot behave well now in linux?
|
Not only in linux. In Windows GetTickCount is used which resolution is in range 10 - 16 ms only.
|
|
|
Re: String::Cat optimization [message #34595 is a reply to message #34592] |
Thu, 01 December 2011 14:50 |
Tom1
Messages: 1212 Registered: March 2007
|
Senior Contributor |
|
|
Hi,
I played around with Mirek's idea awhile and according to my simple '::GetTickCount()' benchmarking on MSC9/Win7x64 I managed to squeeze yet more performance out of it. The test covered all transfer lengths from 1 to 16 bytes.
The svo_memcpy() suffers a performance penalty at len==16, where secondary function call to memcpy steps in. The following macro approach helps dramatically to reduce that penalty. I also discovered that the memcpy() performance might not be reached systematically at transfer lengths above 11 bytes, so limiting the switch to <= 11 bytes should improve overall performance.
inline void memcpy11i(char *t, const char *s, int len){
switch(len) {
case 11: t[10] = s[10];
case 10: t[9] = s[9];
case 9: t[8] = s[8];
case 8: t[7] = s[7];
case 7: t[6] = s[6];
case 6: t[5] = s[5];
case 5: t[4] = s[4];
case 4: t[3] = s[3];
case 3: t[2] = s[2];
case 2: t[1] = s[1];
case 1: t[0] = s[0];
}
}
#define memcpy11(t, s, len) (len)>11 ? memcpy(t, s, len) : memcpy11i(t, s, len)
How does this perform on your systems?
Best regards,
Tom
|
|
|
Re: String::Cat optimization [message #34596 is a reply to message #34594] |
Thu, 01 December 2011 14:54 |
Tom1
Messages: 1212 Registered: March 2007
|
Senior Contributor |
|
|
Uno,
It is true that GetTickCount() runs by default on a 10/16 ms resolution only. However, that does not affect the results too much if your measurement period is on the order of one second or more. (I used 100000000 repetitions to get around this resolution issue.)
Best regards,
Tom
|
|
|
|
Re: String::Cat optimization [message #34604 is a reply to message #34595] |
Thu, 01 December 2011 17:52 |
|
mirek
Messages: 13976 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Thu, 01 December 2011 08:50 | Hi,
I played around with Mirek's idea awhile and according to my simple '::GetTickCount()' benchmarking on MSC9/Win7x64 I managed to squeeze yet more performance out of it. The test covered all transfer lengths from 1 to 16 bytes.
The svo_memcpy() suffers a performance penalty at len==16, where secondary function call to memcpy steps in. The following macro approach helps dramatically to reduce that penalty. I also discovered that the memcpy() performance might not be reached systematically at transfer lengths above 11 bytes, so limiting the switch to <= 11 bytes should improve overall performance.
inline void memcpy11i(char *t, const char *s, int len){
switch(len) {
case 11: t[10] = s[10];
case 10: t[9] = s[9];
case 9: t[8] = s[8];
case 8: t[7] = s[7];
case 7: t[6] = s[6];
case 6: t[5] = s[5];
case 5: t[4] = s[4];
case 4: t[3] = s[3];
case 3: t[2] = s[2];
case 2: t[1] = s[1];
case 1: t[0] = s[0];
}
}
#define memcpy11(t, s, len) (len)>11 ? memcpy(t, s, len) : memcpy11i(t, s, len)
How does this perform on your systems?
Best regards,
Tom
|
Well, that is actually even better, as MSC refuses to inline that function....
Mirek
|
|
|
Re: String::Cat optimization [message #34608 is a reply to message #34595] |
Thu, 01 December 2011 20:41 |
|
mirek
Messages: 13976 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Thu, 01 December 2011 08:50 | Hi,
I played around with Mirek's idea awhile and according to my simple '::GetTickCount()' benchmarking on MSC9/Win7x64 I managed to squeeze yet more performance out of it. The test covered all transfer lengths from 1 to 16 bytes.
The svo_memcpy() suffers a performance penalty at len==16, where secondary function call to memcpy steps in. The following macro approach helps dramatically to reduce that penalty. I also discovered that the memcpy() performance might not be reached systematically at transfer lengths above 11 bytes, so limiting the switch to <= 11 bytes should improve overall performance.
inline void memcpy11i(char *t, const char *s, int len){
switch(len) {
case 11: t[10] = s[10];
case 10: t[9] = s[9];
case 9: t[8] = s[8];
case 8: t[7] = s[7];
case 7: t[6] = s[6];
case 6: t[5] = s[5];
case 5: t[4] = s[4];
case 4: t[3] = s[3];
case 3: t[2] = s[2];
case 2: t[1] = s[1];
case 1: t[0] = s[0];
}
}
#define memcpy11(t, s, len) (len)>11 ? memcpy(t, s, len) : memcpy11i(t, s, len)
How does this perform on your systems?
Best regards,
Tom
|
Thanks, this is even better. Somehow I forgot that if compiler decides not to inline something, I can still force it by macro.
So, I am following your advice, using macro and limit:
#define SVO_MEMCPY(tgt, src, len) \
do { \
const char *s__ = (const char *)(src); \
char *t__ = (char *)(tgt); \
switch(len) { \
case 11: t__[10] = s__[10]; \
case 10: t__[9] = s__[9]; \
case 9: t__[8] = s__[8]; \
case 8: t__[7] = s__[7]; \
case 7: t__[6] = s__[6]; \
case 6: t__[5] = s__[5]; \
case 5: t__[4] = s__[4]; \
case 4: t__[3] = s__[3]; \
case 3: t__[2] = s__[2]; \
case 2: t__[1] = s__[1]; \
case 1: t__[0] = s__[0]; \
break; \
default: \
memcpy(t__, s__, len); \
} \
} while(false)
memcpy
TIMING Cat 40 : 1.98 s - 198.34 ns ( 2.15 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 599.44 ms - 59.94 ns (767.00 ms / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
inline svo_memcpy
TIMING Cat 40 : 1.45 s - 145.38 ns ( 1.63 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 491.79 ms - 49.18 ns (671.00 ms / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
macro
TIMING Cat 40 : 1.04 s - 103.71 ns ( 1.22 s / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
TIMING Cat 18 : 302.09 ms - 30.21 ns (486.00 ms / 10000000 ), min: 0.00 ns, max: 1.00 ms, nesting: 1 - 10000000
Amazing
It is interesting how year after year, we can still squeeze a bit from String....
Mirek
|
|
|
|
Re: String::Cat optimization [message #34613 is a reply to message #34608] |
Fri, 02 December 2011 06:02 |
Novo
Messages: 1358 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Thu, 01 December 2011 14:41 |
Thanks, this is even better. Somehow I forgot that if compiler decides not to inline something, I can still force it by macro.
|
There is an easier way to force inlining:
MSVC - __forceinline
GCC - __attribute__((always_inline))
Regards,
Novo
|
|
|
|
|
Goto Forum:
Current Time: Sat May 11 17:10:55 CEST 2024
Total time taken to generate the page: 0.02063 seconds
|