U++ forum: Welcome to the forum

Search on this site

Search in forums

Home » U++ Library support » U++ Core » Vector performance on a specific situation

Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend

Vector performance on a specific situation [message #40131]

Tue, 18 June 2013 09:01

crydev
Messages: 151
Registered: October 2012
Location: Netherlands

Experienced Member

Hello,

I have a question about the Vector's performance in a specific situation. I have a program that utilizes 8 threads on new systems, heavy utilization of paralellism. Say I have a Vector containing 300 items. I split the indexes of those items over 8 threads, meaning the Vector will be accessed from 8 threads simultaniously, but every thread accesses a different item. The same memory location is never modified.

I have read something about Vector cache lines. What is the performance of the U++ implementation of the Vector in this situation? I tried to copy the thread-specific data into arrays and passed them into the functions, but it seems like just as fast.

If there is a better way to do this, I appreciate any suggestions. Smile

Report message to a moderator

Re: Vector performance on a specific situation [message #40132 is a reply to message #40131]

Tue, 18 June 2013 10:52

mirek
Messages: 13984
Registered: November 2005

Ultimate Member

crydev wrote on Tue, 18 June 2013 03:01

It all dependes on sizeof(T) etc... but if you are doing a lot of access to elements and distribute threads in nearby indicies, cacheline sharing between threads is indeed a big problem.

Note that trivial Vector->Array does not really help here, as individual elements will be likely allocated in the same cachelines cells.

So it all depends on what you are doing with elements. 300 cells does not sound like too many, indicating that per-cell computation is pretty heavy (if it there is any advantage to use multiple threads).

For more qualified reply I would need to know definition of T and some description about computation.

Mirek

Report message to a moderator

Re: Vector performance on a specific situation [message #40135 is a reply to message #40131]

Tue, 18 June 2013 19:11

Novo
Messages: 1361
Registered: December 2006

Ultimate Contributor

crydev wrote on Tue, 18 June 2013 03:01

If you are just reading data there will be no problems. Smile

But if you write to elements (even if they are not shared among threads) you get yourself into false sharing problem. Basically, the idea is that CPU doesn't work with words, it works with cache lines. The simplest way to fix that is to add padding to your data. Example: instead of using raw int you can use a structure below.

struct MyData {
    int data;
    char padding[64 - sizeof(data)];
};

Size of cache line is usually 64 bytes, so you need to add padding to make you data land onto different cache lines.

Regards,
Novo

[Updated on: Fri, 21 June 2013 02:06]

Report message to a moderator

Re: Vector performance on a specific situation [message #40137 is a reply to message #40131]

Wed, 19 June 2013 09:03

crydev
Messages: 151
Registered: October 2012
Location: Netherlands

Experienced Member

Thank you for your answers.

sizeof(T) is 16-bytes.

struct
{
     unsigned int;
     int;
   
     struct
     {
          int;
          int;
     }
}

What my code does is reading a struct instance from the vector and editing the two integer fields in the sub-struct. What I did now, also stated in my first post, is copying the Vector.GetCount() / 8 count of structs from the vector into an array and performing operations there. Afterwards I copy them back into the vector at the original positions.

As I stated, it seems just as fast, the profiler also notes so. Can I conclude from that finding that this operation is faster to prevent cacheline sharing?

Report message to a moderator

Re: Vector performance on a specific situation [message #40138 is a reply to message #40137]

Wed, 19 June 2013 09:14

mirek
Messages: 13984
Registered: November 2005

Ultimate Member

crydev wrote on Wed, 19 June 2013 03:03

Thank you for your answers.

sizeof(T) is 16-bytes.

struct
{
     unsigned int;
     int;
   
     struct
     {
          int;
          int;
     }
}

What my code does is reading a struct instance from the vector and editing the two integer fields in the sub-struct.

This does not sound like awful lot of computation. I think that single thread will handle the task as fast or perhaps faster than multiple threads...

Now if you had 3 millions of elements instead of 300...

(Of course, I can still be mistaken about the amount of computation per element performed. Have you benchmarked that (I mean, single-threaded time to do one element)?

Report message to a moderator

Re: Vector performance on a specific situation [message #40139 is a reply to message #40131]

Wed, 19 June 2013 09:37

crydev
Messages: 151
Registered: October 2012
Location: Netherlands

Experienced Member

The computation on these elements is not very heavy, but the information in these structs is used to read over gigabytes of memory and compare every byte. If I use only one thread to do that it will be busy for a few minutes, where 8 threads will handle it in a few seconds. Smile

The amount of elements differs per process running on a windows machine. A small process has around 300 pages, which makes the vector contain 300 elements, but bigger processes can contain over 2000 pages, which increases workload a lot.

I have not yet benchmarked it for one thread, because I think it doesn't matter. If you use only one thread you simultaniously read from 0 to the end, where this problem is not really applicable. When 8 threads operate on the Vector, the first one operates on 0-49, the next on 50-99, and so on.

Report message to a moderator

Previous Topic:	Error in XML Read
Next Topic:	Troubles compiling code with -fPIC

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

] [

]

Current Time: Wed Jun 05 06:52:23 CEST 2024

Total time taken to generate the page: 0.01311 seconds