Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » U++ Library support » U++ Core » Vector performance on a specific situation
Vector performance on a specific situation [message #40131] Tue, 18 June 2013 09:01 Go to next message
crydev is currently offline  crydev
Messages: 151
Registered: October 2012
Location: Netherlands
Experienced Member
Hello,

I have a question about the Vector's performance in a specific situation. I have a program that utilizes 8 threads on new systems, heavy utilization of paralellism. Say I have a Vector containing 300 items. I split the indexes of those items over 8 threads, meaning the Vector will be accessed from 8 threads simultaniously, but every thread accesses a different item. The same memory location is never modified.

I have read something about Vector cache lines. What is the performance of the U++ implementation of the Vector in this situation? I tried to copy the thread-specific data into arrays and passed them into the functions, but it seems like just as fast.

If there is a better way to do this, I appreciate any suggestions. Smile
Re: Vector performance on a specific situation [message #40132 is a reply to message #40131] Tue, 18 June 2013 10:52 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
crydev wrote on Tue, 18 June 2013 03:01

Hello,

I have a question about the Vector's performance in a specific situation. I have a program that utilizes 8 threads on new systems, heavy utilization of paralellism. Say I have a Vector containing 300 items. I split the indexes of those items over 8 threads, meaning the Vector will be accessed from 8 threads simultaniously, but every thread accesses a different item. The same memory location is never modified.

I have read something about Vector cache lines. What is the performance of the U++ implementation of the Vector in this situation?



It all dependes on sizeof(T) etc... but if you are doing a lot of access to elements and distribute threads in nearby indicies, cacheline sharing between threads is indeed a big problem.

Note that trivial Vector->Array does not really help here, as individual elements will be likely allocated in the same cachelines cells.

So it all depends on what you are doing with elements. 300 cells does not sound like too many, indicating that per-cell computation is pretty heavy (if it there is any advantage to use multiple threads).

For more qualified reply I would need to know definition of T and some description about computation.

Mirek
Re: Vector performance on a specific situation [message #40135 is a reply to message #40131] Tue, 18 June 2013 19:11 Go to previous messageGo to next message
Novo is currently offline  Novo
Messages: 1358
Registered: December 2006
Ultimate Contributor
crydev wrote on Tue, 18 June 2013 03:01

Hello,

I have a question about the Vector's performance in a specific situation. I have a program that utilizes 8 threads on new systems, heavy utilization of paralellism. Say I have a Vector containing 300 items. I split the indexes of those items over 8 threads, meaning the Vector will be accessed from 8 threads simultaniously, but every thread accesses a different item. The same memory location is never modified.

I have read something about Vector cache lines. What is the performance of the U++ implementation of the Vector in this situation? I tried to copy the thread-specific data into arrays and passed them into the functions, but it seems like just as fast.

If there is a better way to do this, I appreciate any suggestions. Smile

If you are just reading data there will be no problems. Smile But if you write to elements (even if they are not shared among threads) you get yourself into false sharing problem. Basically, the idea is that CPU doesn't work with words, it works with cache lines. The simplest way to fix that is to add padding to your data. Example: instead of using raw int you can use a structure below.
struct MyData {
    int data;
    char padding[64 - sizeof(data)];
};


Size of cache line is usually 64 bytes, so you need to add padding to make you data land onto different cache lines.


Regards,
Novo

[Updated on: Fri, 21 June 2013 02:06]

Report message to a moderator

Re: Vector performance on a specific situation [message #40137 is a reply to message #40131] Wed, 19 June 2013 09:03 Go to previous messageGo to next message
crydev is currently offline  crydev
Messages: 151
Registered: October 2012
Location: Netherlands
Experienced Member
Thank you for your answers.

sizeof(T) is 16-bytes.

struct
{
     unsigned int;
     int;
   
     struct
     {
          int;
          int;
     }
}


What my code does is reading a struct instance from the vector and editing the two integer fields in the sub-struct. What I did now, also stated in my first post, is copying the Vector.GetCount() / 8 count of structs from the vector into an array and performing operations there. Afterwards I copy them back into the vector at the original positions.

As I stated, it seems just as fast, the profiler also notes so. Can I conclude from that finding that this operation is faster to prevent cacheline sharing?
Re: Vector performance on a specific situation [message #40138 is a reply to message #40137] Wed, 19 June 2013 09:14 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
crydev wrote on Wed, 19 June 2013 03:03

Thank you for your answers.

sizeof(T) is 16-bytes.

struct
{
     unsigned int;
     int;
   
     struct
     {
          int;
          int;
     }
}


What my code does is reading a struct instance from the vector and editing the two integer fields in the sub-struct.


This does not sound like awful lot of computation. I think that single thread will handle the task as fast or perhaps faster than multiple threads...

Now if you had 3 millions of elements instead of 300...

(Of course, I can still be mistaken about the amount of computation per element performed. Have you benchmarked that (I mean, single-threaded time to do one element)?
Re: Vector performance on a specific situation [message #40139 is a reply to message #40131] Wed, 19 June 2013 09:37 Go to previous message
crydev is currently offline  crydev
Messages: 151
Registered: October 2012
Location: Netherlands
Experienced Member
The computation on these elements is not very heavy, but the information in these structs is used to read over gigabytes of memory and compare every byte. If I use only one thread to do that it will be busy for a few minutes, where 8 threads will handle it in a few seconds. Smile

The amount of elements differs per process running on a windows machine. A small process has around 300 pages, which makes the vector contain 300 elements, but bigger processes can contain over 2000 pages, which increases workload a lot.

I have not yet benchmarked it for one thread, because I think it doesn't matter. If you use only one thread you simultaniously read from 0 to the end, where this problem is not really applicable. When 8 threads operate on the Vector, the first one operates on 0-49, the next on 50-99, and so on.
Previous Topic: Error in XML Read
Next Topic: Troubles compiling code with -fPIC
Goto Forum:
  


Current Time: Fri Mar 29 09:46:33 CET 2024

Total time taken to generate the page: 0.01509 seconds