Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » Large data ahead
Large data ahead [message #51119] Tue, 29 January 2019 15:56 Go to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

It seems there is 'Large data ahead' as I have already faced data sets with over 1e9 items. While the containers (Vector, Array, etc.) have an "int GetCount()", there will soon be a question about handling those large data sets as int will wrap. Obviously, these large data sets run only on 64-bit platforms and with 32+ GB of memory, but this seems to be reality already.

Is there a plan to make high item count variants of U++ containers? Or just change the current containers to be high item count compatible using int64 instead of int?

Best regards,

Tom
Re: Large data ahead [message #51120 is a reply to message #51119] Tue, 29 January 2019 20:47 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Well, I acknowledge that this is an issue that needs to be considered....

The original reason for 'int' is memory consumption. In the moment that it would be changed to size_t means that all offsets everywhere are now twice as long...

For that reason I believe having "HugeVector" etc... is not that bad idea... But I might be wrong.

Mirek
Re: Large data ahead [message #51123 is a reply to message #51120] Wed, 30 January 2019 10:22 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
Hi,

How about making the containers internally int64 and then offering both int and int64 interfaces in parallel? I'm sure there would be some tight spots (with e.g. serialization to maintain binary compatibility with int indexed serialized content in a way that they maintain the old binary serialization format until the element count goes beyond reach of int.) In most cases we know for sure already that the item count will absolutely remain below 2G, so there is no point in using int64 indexing in those cases, and therefore int offsets can be retained throughout the rest of such code.

Anyway, I have no hurry for this at the moment, so let's just give it some time...

Best regards,

Tom

Re: Large data ahead [message #51124 is a reply to message #51123] Wed, 30 January 2019 10:43 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Wed, 30 January 2019 10:22
Hi,
How about making the containers internally int64 and then offering both int and int64 interfaces in parallel?


The reason I am hesitant about this is following:

struct Item {
Vector<String> foo;
Vector<int> bar;
};

Now with 'int' size, this is 32 bytes. With 'int64', this grows by 16 bytes (50%) which would be wasted to store zeroes in 99.999% of cases...

Also, Vector having exactly 16 bytes has (very) subtle advantage, it is a 'nice' number (memory is allocated in multiplies of 16, adressing can be done with simple shift etc...

Mirek

[Updated on: Wed, 30 January 2019 10:43]

Report message to a moderator

Re: Large data ahead [message #51125 is a reply to message #51124] Wed, 30 January 2019 11:50 Go to previous messageGo to next message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
OK, I see. I agree it is a stupid thing to waste RAM for storing a whole lot of zeros. Efficiency and speed must be considered in every step along the way and this is one of those steps.

As you pointed out, this is really a "99.999% of 32 bit and 0.001 % of 64 bit" -type of situation, so maybe I should first consider using Buffer<T> instead of Vector<T>. This may be beneficial from multiple points of view after all...

Thanks for your insight. Smile

BR, Tom
Re: Large data ahead [message #51126 is a reply to message #51125] Wed, 30 January 2019 12:09 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Tom1 wrote on Wed, 30 January 2019 11:50
OK, I see. I agree it is a stupid thing to waste RAM for storing a whole lot of zeros. Efficiency and speed must be considered in every step along the way and this is one of those steps.

As you pointed out, this is really a "99.999% of 32 bit and 0.001 % of 64 bit" -type of situation, so maybe I should first consider using Buffer<T> instead of Vector<T>. This may be beneficial from multiple points of view after all...

Thanks for your insight. Smile

BR, Tom


Well, all that said, I am 100% OK with HugeVector (and then most of other containers) with int64 GetCount()...
Re: Large data ahead [message #51127 is a reply to message #51126] Wed, 30 January 2019 15:07 Go to previous message
Tom1
Messages: 1212
Registered: March 2007
Senior Contributor
mirek wrote on Wed, 30 January 2019 13:09
Tom1 wrote on Wed, 30 January 2019 11:50
OK, I see. I agree it is a stupid thing to waste RAM for storing a whole lot of zeros. Efficiency and speed must be considered in every step along the way and this is one of those steps.

As you pointed out, this is really a "99.999% of 32 bit and 0.001 % of 64 bit" -type of situation, so maybe I should first consider using Buffer<T> instead of Vector<T>. This may be beneficial from multiple points of view after all...

Thanks for your insight. Smile

BR, Tom


Well, all that said, I am 100% OK with HugeVector (and then most of other containers) with int64 GetCount()...


OK, Thanks Mirek. I will keep this in mind. The other points of view above include the fact that when loading huge data sets to RAM (near physical memory limits), re-allocating the buffer when Vector grows is a problem. Therefore, an accurately pre-calculated memory block size and a single allocation of that block really pays off -- both in processing time and in memory efficiency. Then Buffer<T> is just about the right choice.

Thanks and best regards,

Tom
Previous Topic: get debug info after intermittent crashing of theide
Next Topic: umk on Linux
Goto Forum:
  


Current Time: Thu Apr 18 03:05:31 CEST 2024

Total time taken to generate the page: 0.02630 seconds