Home » Developing U++ » U++ Developers corner » Large data ahead
Large data ahead [message #51119] |
Tue, 29 January 2019 15:56  |
Tom1
Messages: 1301 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi,
It seems there is 'Large data ahead' as I have already faced data sets with over 1e9 items. While the containers (Vector, Array, etc.) have an "int GetCount()", there will soon be a question about handling those large data sets as int will wrap. Obviously, these large data sets run only on 64-bit platforms and with 32+ GB of memory, but this seems to be reality already.
Is there a plan to make high item count variants of U++ containers? Or just change the current containers to be high item count compatible using int64 instead of int?
Best regards,
Tom
|
|
|
|
Re: Large data ahead [message #51123 is a reply to message #51120] |
Wed, 30 January 2019 10:22   |
Tom1
Messages: 1301 Registered: March 2007
|
Ultimate Contributor |
|
|
Hi,
How about making the containers internally int64 and then offering both int and int64 interfaces in parallel? I'm sure there would be some tight spots (with e.g. serialization to maintain binary compatibility with int indexed serialized content in a way that they maintain the old binary serialization format until the element count goes beyond reach of int.) In most cases we know for sure already that the item count will absolutely remain below 2G, so there is no point in using int64 indexing in those cases, and therefore int offsets can be retained throughout the rest of such code.
Anyway, I have no hurry for this at the moment, so let's just give it some time...
Best regards,
Tom
|
|
|
Re: Large data ahead [message #51124 is a reply to message #51123] |
Wed, 30 January 2019 10:43   |
 |
mirek
Messages: 14255 Registered: November 2005
|
Ultimate Member |
|
|
Tom1 wrote on Wed, 30 January 2019 10:22Hi,
How about making the containers internally int64 and then offering both int and int64 interfaces in parallel?
The reason I am hesitant about this is following:
struct Item {
Vector<String> foo;
Vector<int> bar;
};
Now with 'int' size, this is 32 bytes. With 'int64', this grows by 16 bytes (50%) which would be wasted to store zeroes in 99.999% of cases...
Also, Vector having exactly 16 bytes has (very) subtle advantage, it is a 'nice' number (memory is allocated in multiplies of 16, adressing can be done with simple shift etc...
Mirek
[Updated on: Wed, 30 January 2019 10:43] Report message to a moderator
|
|
|
|
|
Re: Large data ahead [message #51127 is a reply to message #51126] |
Wed, 30 January 2019 15:07  |
Tom1
Messages: 1301 Registered: March 2007
|
Ultimate Contributor |
|
|
mirek wrote on Wed, 30 January 2019 13:09Tom1 wrote on Wed, 30 January 2019 11:50OK, I see. I agree it is a stupid thing to waste RAM for storing a whole lot of zeros. Efficiency and speed must be considered in every step along the way and this is one of those steps.
As you pointed out, this is really a "99.999% of 32 bit and 0.001 % of 64 bit" -type of situation, so maybe I should first consider using Buffer<T> instead of Vector<T>. This may be beneficial from multiple points of view after all...
Thanks for your insight. 
BR, Tom
Well, all that said, I am 100% OK with HugeVector (and then most of other containers) with int64 GetCount()...
OK, Thanks Mirek. I will keep this in mind. The other points of view above include the fact that when loading huge data sets to RAM (near physical memory limits), re-allocating the buffer when Vector grows is a problem. Therefore, an accurately pre-calculated memory block size and a single allocation of that block really pays off -- both in processing time and in memory efficiency. Then Buffer<T> is just about the right choice.
Thanks and best regards,
Tom
|
|
|
Goto Forum:
Current Time: Fri Apr 25 17:21:21 CEST 2025
Total time taken to generate the page: 0.01369 seconds
|