|
|
Home » Community » Coffee corner » Map implementation
|
Re: Map implementation [message #51552 is a reply to message #51549] |
Sun, 14 April 2019 18:55 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Sun, 14 April 2019 02:03If the performance is the issue, then the memory is the issue too. The game starts at L1 cache size, which can correspond to hunderds of records.
How to deal with the memory hierarchy is more or less clear (Cache-Conscious Data Structures, Cache-oblivious algorithms).
What I'm trying to say is that by using a little bit more memory you can significantly improve performance. For example, you can create a bitset of unoccupied slots. That would be overkill for a small hash table.
I made a simple test.
Vector<Index<int> > v;
Cout() << "sizeof(Index<int>): " << sizeof(Index<int>) << " bytes" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
v.SetCount(v_num);
Cout() << "Created " << v_num << " empty Index<int>" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
const int isize = 100;
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Add(i);
Cout() << "Added " << i + 1 << " elements" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
}
Result:
sizeof(Index<int>): 80 bytes
Mem used: 0 Kb
Created 1000000 empty Index<int>
Mem used: 78128 Kb
Added 1 elements
Mem used: 237028 Kb
Added 2 elements
Mem used: 237028 Kb
Added 3 elements
Mem used: 237028 Kb
Added 4 elements
Mem used: 237028 Kb
Added 5 elements
Mem used: 237028 Kb
Added 6 elements
Mem used: 237028 Kb
Added 7 elements
Mem used: 237028 Kb
Added 8 elements
Mem used: 237028 Kb
Added 9 elements
Mem used: 397796 Kb
Added 10 elements
Mem used: 397796 Kb
...
Added 99 elements
Mem used: 2592740 Kb
Added 100 elements
Mem used: 2592740 Kb
IMHO, it is possible to do much better than this ...
Regards,
Novo
|
|
|
Re: Map implementation [message #51553 is a reply to message #51552] |
Sun, 14 April 2019 19:52 |
|
mirek
Messages: 14112 Registered: November 2005
|
Ultimate Member |
|
|
Novo wrote on Sun, 14 April 2019 18:55mirek wrote on Sun, 14 April 2019 02:03If the performance is the issue, then the memory is the issue too. The game starts at L1 cache size, which can correspond to hunderds of records.
How to deal with the memory hierarchy is more or less clear (Cache-Conscious Data Structures, Cache-oblivious algorithms).
What I'm trying to say is that by using a little bit more memory you can significantly improve performance. For example, you can create a bitset of unoccupied slots. That would be overkill for a small hash table.
I made a simple test.
Vector<Index<int> > v;
Cout() << "sizeof(Index<int>): " << sizeof(Index<int>) << " bytes" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
v.SetCount(v_num);
Cout() << "Created " << v_num << " empty Index<int>" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
const int isize = 100;
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Add(i);
Cout() << "Added " << i + 1 << " elements" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
}
Result:
sizeof(Index<int>): 80 bytes
Mem used: 0 Kb
Created 1000000 empty Index<int>
Mem used: 78128 Kb
Added 1 elements
Mem used: 237028 Kb
Added 2 elements
Mem used: 237028 Kb
Added 3 elements
Mem used: 237028 Kb
Added 4 elements
Mem used: 237028 Kb
Added 5 elements
Mem used: 237028 Kb
Added 6 elements
Mem used: 237028 Kb
Added 7 elements
Mem used: 237028 Kb
Added 8 elements
Mem used: 237028 Kb
Added 9 elements
Mem used: 397796 Kb
Added 10 elements
Mem used: 397796 Kb
...
Added 99 elements
Mem used: 2592740 Kb
Added 100 elements
Mem used: 2592740 Kb
IMHO, it is possible to do much better than this ...
Well, keep in mind that lower bound here is 400MB - that is what will cost to store keys itself. If these keys were String, which in reality is much more typical, it would be 1600MB just for values (if they are small).
Current Index overhead is on average ~20 bytes per element, which about matches what we see here. Hard to say you can do much better. E.g. typical 'collisions are linked list' implementation will use about the same number of bytes. Even open addressing will need at least 8 bytes per node if you want to have any meaningful 'payload'.
Anyway, thanks for test. You are right that you can reduce that for very specific scenarios. And next index version will probably do about 20% better.
BTW, test putting 100 * 1000000 elements into single Index will produce the same results.
Mirek
P.S.: Overall I am happy that you both started digging here as I am thinking about refactoring Index, deprecating and detuning some unsused features (ever used FindPrev? in favor of those used most frequently.
|
|
|
Re: Map implementation [message #51558 is a reply to message #51552] |
Tue, 16 April 2019 12:38 |
|
mirek
Messages: 14112 Registered: November 2005
|
Ultimate Member |
|
|
Novo wrote on Sun, 14 April 2019 18:55
Added 100 elements
Mem used: 2592740 Kb
[/code]
IMHO, it is possible to do much better than this ...
Just for reference (Visual C++ 64 bit release):
#include <Core/Core.h>
#include <set>
#include <vector>
using namespace Upp;
CONSOLE_APP_MAIN
{
int curMU = MemoryUsedKb();
int v_num = 1000000;
std::vector< std::set<int> > v;
Cout() << "sizeof(Index<int>): " << sizeof(Index<int>) << " bytes" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
v.resize(v_num);
Cout() << "Created " << v_num << " empty Index<int>" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
const int isize = 100;
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].insert(i);
Cout() << "Added " << i + 1 << " elements" << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
}
}
Added 100 elements
Mem used: 3222476 Kb
Replace set with unorderd_set and
Added 100 elements
Mem used: 12412392 Kb
sizeof(std::unordered_set<int>) = 64
GCC 64bits, unorderd_set
Added 100 elements
Mem used: 4621148 Kb
sizeof(std::unordered_set<int>) = 56
BTW, I am now banging my head about reducing overhead from 20 bytes to 16 per element. It looks like it might not be possible without performance degradation (in all scenarios). So I would say it is really hard to get below 20 per element.
Mirek
[Updated on: Tue, 16 April 2019 12:40] Report message to a moderator
|
|
|
Re: Map implementation [message #51559 is a reply to message #51558] |
Tue, 16 April 2019 16:38 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
BTW, In the project that I attached there are two configurations. A default one, which is using an U++ allocator, and a second one, which is using a standard allocator.
You should get different numbers with different configurations.
Regards,
Novo
|
|
|
|
|
|
|
Re: Map implementation [message #51569 is a reply to message #51565] |
Wed, 17 April 2019 16:13 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Wed, 17 April 2019 02:54
Well, nothing surprising there, right?
Yes, but ...
$ ./test_ht_perf
Mem used: 78128 Kb
Index<int> Add: 7.035
Index<int> Unlink: 12.272
Mem used: 2592740 Kb
Mem used after Sweep: 3800292 Kb
Mem used: 3769040 Kb
std::set<int> insert: 7.381
std::set<int> erase: 4.296
Mem used: 7603920 Kb
if (true) {
Vector<Index<int> > v;
v.SetCount(v_num);
const int isize = 100;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
TimeStop ts;
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Add(i);
}
Cout() << "Index<int> Add: " << ts.ToString() << EOL;
ts.Reset();
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].UnlinkKey(i);
}
Cout() << "Index<int> Unlink: " << ts.ToString() << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Sweep();
Cout() << "Mem used after Sweep: " << MemoryUsedKb() - curMU << " Kb" << EOL;
}
if (true) {
std::set<int>* v;
v = new std::set<int>[v_num];
const int isize = 100;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
TimeStop ts;
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].insert(i);
}
Cout() << "std::set<int> insert: " << ts.ToString() << EOL;
ts.Reset();
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].erase(i);
}
Cout() << "std::set<int> erase: " << ts.ToString() << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
}
Project is attached.
std::set<int> erase is three times faster than Index<int> Unlink.
After calling Index::Sweep even more memory is used. I guess this is a problem with the allocator.
And Index invalidates pointers.
So ...
Regards,
Novo
|
|
|
Re: Map implementation [message #51570 is a reply to message #51569] |
Wed, 17 April 2019 18:39 |
|
mirek
Messages: 14112 Registered: November 2005
|
Ultimate Member |
|
|
Novo wrote on Wed, 17 April 2019 16:13mirek wrote on Wed, 17 April 2019 02:54
Well, nothing surprising there, right?
Yes, but ...
$ ./test_ht_perf
Mem used: 78128 Kb
Index<int> Add: 7.035
Index<int> Unlink: 12.272
Mem used: 2592740 Kb
Mem used after Sweep: 3800292 Kb
Mem used: 3769040 Kb
std::set<int> insert: 7.381
std::set<int> erase: 4.296
Mem used: 7603920 Kb
if (true) {
Vector<Index<int> > v;
v.SetCount(v_num);
const int isize = 100;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
TimeStop ts;
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Add(i);
}
Cout() << "Index<int> Add: " << ts.ToString() << EOL;
ts.Reset();
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].UnlinkKey(i);
}
Cout() << "Index<int> Unlink: " << ts.ToString() << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Sweep();
Cout() << "Mem used after Sweep: " << MemoryUsedKb() - curMU << " Kb" << EOL;
}
if (true) {
std::set<int>* v;
v = new std::set<int>[v_num];
const int isize = 100;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
TimeStop ts;
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].insert(i);
}
Cout() << "std::set<int> insert: " << ts.ToString() << EOL;
ts.Reset();
for (int i = 0; i < isize; ++i) {
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].erase(i);
}
Cout() << "std::set<int> erase: " << ts.ToString() << EOL;
Cout() << "Mem used: " << MemoryUsedKb() - curMU << " Kb" << EOL;
}
Project is attached.
std::set<int> erase is three times faster than Index<int> Unlink.
After calling Index::Sweep even more memory is used. I guess this is a problem with the allocator.
And Index invalidates pointers.
So ...
Cool intresting catch.
Took me a while digging into the memory issue and it is really interesting - it looks like the problem is in this line
Quote:
void HashBase::FinishIndex()
{
int q = link.GetCount();
link.Reserve(hash.GetAlloc()); <==== HERE
link.AddN(hash.GetCount() - q);
for(int i = q; i < hash.GetCount(); i++)
LinkTo(i, link[i], hash[i] & UNSIGNED_HIBIT ? unlinked : Mapi(i));
}
If you comment it out, all works fine. The reason is that at that point, 'overreservation' of link in this particular leads to going from small blocks to large ones. Those small ones then are left free for further use, which in this example never materializes.
I will definitely keep this scenario for testing with the new Index... That said, this really is not likely to happen in real app.
Going to look into Unlink issue now.
|
|
|
|
Re: Map implementation [message #51955 is a reply to message #51813] |
Tue, 25 June 2019 15:20 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Fri, 07 June 2019 07:58Index and allocator are now refactored, should behave better in synthetic benchmarks as is this one.
It performs better, but Unlink is still ~2.5 times slower than std::set::erase.
Mem used: 39064 Kb
Index<int> Add: 5.691
Index<int> Unlink: 9.380
Mem used: 2959732 Kb
Index<int> Sweep: 0.204
Mem used after Sweep: 2959732 Kb
Index<int> Shrink: 0.122
Mem used after Shrink: 39096 Kb
Mem used: 46908 Kb
std::set<int> insert: 5.975
std::set<int> erase: 3.612
Mem used: 46904 Kb
Regards,
Novo
|
|
|
Re: Map implementation [message #51957 is a reply to message #51955] |
Wed, 26 June 2019 12:07 |
|
mirek
Messages: 14112 Registered: November 2005
|
Ultimate Member |
|
|
Well, with benchmark constructed like this, beating set<int>::erase is accidentally hard.
Interesting things is that if you change the order of loops:
#include <Core/Core.h>
#include <set>
using namespace Upp;
CONSOLE_APP_MAIN
{
#ifdef _DEBUG
const int v_num = 10000;
#else
const int v_num = 100000;
#endif
const int isize = 100;
{
Vector<Index<int> > v;
v.SetCount(v_num);
{
RTIMING("FindAdd v_num outer");
for (int j = 0; j < v_num; ++j)
for (int i = 0; i < isize; ++i)
v[j].FindAdd(i);
}
{
RTIMING("UnlinkKey v_num outer");
for (int j = 0; j < v_num; ++j)
for (int i = 0; i < isize; ++i)
v[j].UnlinkKey(i);
}
RTIMING("Sweep v_num outer");
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Sweep();
}
{
Vector<Index<int> > v;
v.SetCount(v_num);
{
RTIMING("FindAdd v_num inner");
for (int i = 0; i < isize; ++i)
for (int j = 0; j < v_num; ++j)
v[j].FindAdd(i);
}
{
RTIMING("UnlinkKey v_num inner");
for (int i = 0; i < isize; ++i)
for (int j = 0; j < v_num; ++j)
v[j].UnlinkKey(i);
}
RTIMING("Sweep v_num inner");
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Sweep();
}
{
std::set<int> *v = new std::set<int>[v_num];
{
RTIMING("insert v_num outer");
for (int j = 0; j < v_num; ++j)
for (int i = 0; i < isize; ++i)
v[j].insert(i);
}
{
RTIMING("erase v_num outer");
for (int j = 0; j < v_num; ++j)
for (int i = 0; i < isize; ++i)
v[j].erase(i);
}
delete[] v;
}
{
std::set<int> *v = new std::set<int>[v_num];
{
RTIMING("insert v_num inner");
for (int i = 0; i < isize; ++i)
for (int j = 0; j < v_num; ++j)
v[j].insert(i);
}
{
RTIMING("erase v_num inner");
for (int i = 0; i < isize; ++i)
for (int j = 0; j < v_num; ++j)
v[j].erase(i);
}
delete[] v;
}
}
results are very different:
* C:\upp\out\benchmarks\MINGWx64\Index.exe 26.06.2019 11:36:42, user: cxl
TIMING erase v_num inner: 480.00 ms - 480.00 ms (480.00 ms / 1 ), min: 480.00 ms, max: 480.00 ms, nesting: 0 - 1
TIMING insert v_num inner: 702.00 ms - 702.00 ms (702.00 ms / 1 ), min: 702.00 ms, max: 702.00 ms, nesting: 0 - 1
TIMING erase v_num outer: 427.00 ms - 427.00 ms (427.00 ms / 1 ), min: 427.00 ms, max: 427.00 ms, nesting: 0 - 1
TIMING insert v_num outer: 399.00 ms - 399.00 ms (399.00 ms / 1 ), min: 399.00 ms, max: 399.00 ms, nesting: 0 - 1
TIMING Sweep v_num inner: 22.00 ms - 22.00 ms (22.00 ms / 1 ), min: 22.00 ms, max: 22.00 ms, nesting: 0 - 1
TIMING UnlinkKey v_num inner: 995.00 ms - 995.00 ms (995.00 ms / 1 ), min: 995.00 ms, max: 995.00 ms, nesting: 0 - 1
TIMING FindAdd v_num inner: 683.00 ms - 683.00 ms (683.00 ms / 1 ), min: 683.00 ms, max: 683.00 ms, nesting: 0 - 1
TIMING Sweep v_num outer: 28.00 ms - 28.00 ms (28.00 ms / 1 ), min: 28.00 ms, max: 28.00 ms, nesting: 0 - 1
TIMING UnlinkKey v_num outer: 118.00 ms - 118.00 ms (118.00 ms / 1 ), min: 118.00 ms, max: 118.00 ms, nesting: 0 - 1
TIMING FindAdd v_num outer: 242.00 ms - 242.00 ms (242.00 ms / 1 ), min: 242.00 ms, max: 242.00 ms, nesting: 0 - 1
which probably means that set<int> is more cache friendly in the original benchmark....
|
|
|
|
Re: Map implementation [message #51999 is a reply to message #51994] |
Mon, 01 July 2019 17:53 |
|
mirek
Messages: 14112 Registered: November 2005
|
Ultimate Member |
|
|
Things are quite different if instead of incremental pattern you feed in random data:
#include <Core/Core.h>
#include <set>
using namespace Upp;
CONSOLE_APP_MAIN
{
#ifdef _DEBUG
const int v_num = 10000;
#else
const int v_num = 1000;
#endif
const int isize = 100;
const int N = 100;
Vector<int> data;
for(int i = 0; i < isize * v_num; i++)
data.Add(Random());
for(int ii = 0; ii < N; ii++) {
{
Vector<Index<int> > v;
v.SetCount(v_num);
{
RTIMING("inner FindAdd v_num");
int *s = data;
for (int i = 0; i < isize; ++i)
for (int j = 0; j < v_num; ++j)
v[j].FindAdd(*s++);
}
{
RTIMING("inner UnlinkKey v_num");
int *s = data;
for (int i = 0; i < isize; ++i)
for (int j = 0; j < v_num; ++j)
v[j].UnlinkKey(*s++);
}
RTIMING("inner Sweep v_num");
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Sweep();
}
{
Vector<Index<int> > v;
v.SetCount(v_num);
{
RTIMING("outer FindAdd v_num");
int *s = data;
for (int j = 0; j < v_num; ++j)
for (int i = 0; i < isize; ++i)
v[j].FindAdd(*s++);
}
{
RTIMING("outer UnlinkKey v_num");
int *s = data;
for (int j = 0; j < v_num; ++j)
for (int i = 0; i < isize; ++i)
v[j].UnlinkKey(*s++);
}
RTIMING("outer Sweep v_num");
const int jsize = v_num;
for (int j = 0; j < jsize; ++j)
v[j].Sweep();
}
{
std::set<int> *v = new std::set<int>[v_num];
{
RTIMING("outer insert v_num");
int *s = data;
for (int j = 0; j < v_num; ++j)
for (int i = 0; i < isize; ++i)
v[j].insert(*s++);
}
{
RTIMING("outer erase v_num");
int *s = data;
for (int j = 0; j < v_num; ++j)
for (int i = 0; i < isize; ++i)
v[j].erase(*s++);
}
delete[] v;
}
{
std::set<int> *v = new std::set<int>[v_num];
{
RTIMING("inner insert v_num");
int *s = data;
for (int i = 0; i < isize; ++i)
for (int j = 0; j < v_num; ++j)
v[j].insert(*s++);
}
{
RTIMING("inner erase v_num");
int *s = data;
for (int i = 0; i < isize; ++i)
for (int j = 0; j < v_num; ++j)
v[j].erase(*s++);
}
delete[] v;
}
}
}
I guess incremental data somehow favors set, my guess is that it works as accidental prefetch here...
[Updated on: Mon, 01 July 2019 17:53] Report message to a moderator
|
|
|
Goto Forum:
Current Time: Sun Nov 10 21:05:39 CET 2024
Total time taken to generate the page: 0.03834 seconds
|
|
|