|
|
Home » Community » U++ community news and announcements » Core 2019
|
|
|
|
Re: Core 2019 [message #51856 is a reply to message #51855] |
Sun, 09 June 2019 23:11 |
|
mirek
Messages: 14038 Registered: November 2005
|
Ultimate Member |
|
|
Novo wrote on Sun, 09 June 2019 21:39mirek wrote on Sun, 09 June 2019 12:54Looking at peak profile, it looks like there are very little "small" blocks and most of memory is in those 80 "huge" (that means >64KB) blocks.
Can that be correct?
Mirek
It is hard to tell. I'm not controlling that.
Another problem is that all allocations/deallocations happen in CoWork's threads. I cannot call RDUMP(*PeakMemoryProfile()) inside of CoWork because it will be called at least 181363 times ...
Why would you want to? Peak is really peak, it is profile at the moment when there is maximum memory use.
One caveat about profile is that it is only profile of current thread for small and large blocks. But our problem is with huge blocks anyway.
Quote:
The app is parsing Wikipedia XML dump. It is decompressing a bz2 archive and parsing chunks of XML. After that my own parser is parsing Mediawiki text.
As a first pass my parser is building a list of tokens organized as a Vector<> (I'm not inserting in the middle )
My parser is avoiding memory allocation at all possible costs. I'm calling Vector::SetCountR and reusing these vectors. When I need to deal with String I'm using my own not owning data string class.
Well, maybe there can also be an interference with MemoryTryRealloc (as those Vectors grow). Perhaps you can test what happens if
bool MemoryTryRealloc(void *ptr, size_t& newsize) {
return false; // (((dword)(uintptr_t)ptr) & 16) && MemoryTryRealloc__(ptr, newsize);
}
Quote:
Unfortunately, I cannot control memory allocation with XmlParser. I have to relay on the default allocator.
There are not many... BTW, are you parsing memory - XmlParser(const char *), or streams - XmlParser(Stream& in) ?
Mirek
|
|
|
Re: Core 2019 [message #51857 is a reply to message #51856] |
Sun, 09 June 2019 23:25 |
|
mirek
Messages: 14038 Registered: November 2005
|
Ultimate Member |
|
|
Here is the code for logging all huge allocations (replace in Core/hheap.cpp):
void *Heap::HugeAlloc(size_t count) // count in 4kb pages
{
ASSERT(count);
#ifdef LSTAT
if(count < 65536)
hstat[count]++;
#endif
huge_4KB_count += count;
if(huge_4KB_count > huge_4KB_count_max) {
huge_4KB_count_max = huge_4KB_count;
if(MemoryUsedKb() > sKBLimit)
Panic("MemoryLimitKb breached!");
if(sPeak)
Make(*sPeak);
}
if(!D::freelist[0]->next) { // initialization
for(int i = 0; i < 2; i++)
Dbl_Self(D::freelist[i]);
}
if(count > HPAGE) { // we are wasting 4KB to store just 4 bytes here, but this is >32MB after all..
LTIMING("SysAlloc");
byte *sysblk = (byte *)SysAllocRaw((count + 1) * 4096, 0);
BlkHeader *h = (BlkHeader *)(sysblk + 4096);
h->size = 0;
*((size_t *)sysblk) = count;
sys_count++;
sys_size += 4096 * count;
return h;
}
LTIMING("Huge Alloc");
word wcount = (word)count;
if(16 * free_4KB > huge_4KB_count) // keep number of free 4KB blocks in check
FreeSmallEmpty(INT_MAX, int(free_4KB - huge_4KB_count / 32));
for(int pass = 0; pass < 2; pass++) {
for(int i = count >= 16; i < 2; i++) {
BlkHeader *l = D::freelist[i];
BlkHeader *h = l->next;
while(h != l) {
word sz = h->GetSize();
if(sz >= count) {
void *ptr = MakeAlloc(h, wcount);
if(count > 16)
RLOG("HugeAlloc " << asString(ptr) << ", size: " << asString(count));
return ptr;
}
h = h->next;
}
}
if(!FreeSmallEmpty(wcount, INT_MAX)) { // try to coalesce 4KB small free blocks back to huge storage
void *ptr = SysAllocRaw(HPAGE * 4096, 0);
HugePage *pg = (HugePage *)MemoryAllocPermanent(sizeof(HugePage));
pg->page = ptr;
pg->next = huge_pages;
huge_pages = pg;
AddChunk((BlkHeader *)ptr, HPAGE); // failed, add 32MB from the system
huge_chunks++;
}
}
Panic("Out of memory");
return NULL;
}
int Heap::HugeFree(void *ptr)
{
BlkHeader *h = (BlkHeader *)ptr;
if(h->size == 0) {
LTIMING("Sys Free");
byte *sysblk = (byte *)h - 4096;
size_t count = *((size_t *)sysblk);
SysFreeRaw(sysblk, (count + 1) * 4096);
huge_4KB_count -= count;
sys_count--;
sys_size -= 4096 * count;
return 0;
}
LTIMING("Huge Free");
if(h->GetSize() > 16)
RLOG("HugeFree " << asString(ptr) << ", size: " << asString(h->GetSize()));
huge_4KB_count -= h->GetSize();
return BlkHeap::Free(h)->GetSize();
}
bool Heap::HugeTryRealloc(void *ptr, size_t count)
{
bool b = count <= HPAGE && BlkHeap::TryRealloc(ptr, count, huge_4KB_count);
if(b)
RLOG("HugeRealloc " << asString(ptr) << ", size: " << asString(count));
return b;
}
(please test with active MemoryTryRealloc)
[Updated on: Sun, 09 June 2019 23:25] Report message to a moderator
|
|
|
|
Re: Core 2019 [message #51862 is a reply to message #51860] |
Mon, 10 June 2019 18:01 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Mon, 10 June 2019 11:27I have tried to improve fragmentation using approximate best fit, hopefully this will help a bit... (in trunk)
Thanks!
mem: 400 Mb, time: 230 s.
This is a huge improvement.
Regards,
Novo
|
|
|
|
Re: Core 2019 [message #51864 is a reply to message #51856] |
Mon, 10 June 2019 18:18 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Sun, 09 June 2019 17:11BTW, are you parsing memory - XmlParser(const char *), or streams - XmlParser(Stream& in) ?
Stream. bz2::DecompressStream.
I guess that XmlParser is responsible for fragmentation.
Regards,
Novo
|
|
|
Re: Core 2019 [message #51865 is a reply to message #51863] |
Mon, 10 June 2019 18:21 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Mon, 10 June 2019 12:17
Cool. So I guess issue solved and we do not need to worry about other tests?
I'll try to run other tests and see what happens ...
Regards,
Novo
|
|
|
Re: Core 2019 [message #51866 is a reply to message #51856] |
Mon, 10 June 2019 18:34 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Sun, 09 June 2019 17:11
Well, maybe there can also be an interference with MemoryTryRealloc (as those Vectors grow). Perhaps you can test what happens if
bool MemoryTryRealloc(void *ptr, size_t& newsize) {
return false; // (((dword)(uintptr_t)ptr) & 16) && MemoryTryRealloc__(ptr, newsize);
}
This doesn't affect anything.
Regards,
Novo
|
|
|
Re: Core 2019 [message #51867 is a reply to message #51857] |
Mon, 10 June 2019 18:45 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Sun, 09 June 2019 17:25Here is the code for logging all huge allocations (replace in Core/hheap.cpp):
This code is crashing with the latest trunk.
I guess we can stop at this point.
Regards,
Novo
|
|
|
Re: Core 2019 [message #51923 is a reply to message #51812] |
Fri, 21 June 2019 05:43 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
Update.
I changed my app. Now it is doing a bunch of string manipulations (mostly concatenations)
Results:
U++:
time: 234s, mem is growing to 4.4Gb and it is not going down till the very end.
glibc:
Default settings (8-core CPU, CoWork pool has 18 threads):
time: 239s, mem max is 6.5Gb down to 3.6Gb
export MALLOC_ARENA_MAX=16
time: 239s, mem max is 4.2Gb down to 2.8Gb
export MALLOC_ARENA_MAX=8
time: 244s, mem max is 4.0Gb down to 1.3Gb
Conclusion:
glibc allocator is more efficient with a little bit of manual tuning. Difference in performance is not that signifficant. Cannot tell anything about Windows.
Regards,
Novo
[Updated on: Sat, 22 June 2019 18:08] Report message to a moderator
|
|
|
Re: Core 2019 [message #51924 is a reply to message #51923] |
Fri, 21 June 2019 06:14 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
It looks like performance of the glibc allocator depends on a state of the system. Got new interesting results:
export MALLOC_ARENA_MAX=20
time: 239s, mem max is 3.9Gb down to 0.7Gb
Regards,
Novo
|
|
|
Re: Core 2019 [message #51925 is a reply to message #51924] |
Fri, 21 June 2019 09:16 |
|
mirek
Messages: 14038 Registered: November 2005
|
Ultimate Member |
|
|
I think you cannot deduce too much about efficiency looking at "down" number. That will depend a lot on overall load of system, more the load, less this number as system will page out unused pages from your apps address space. So the current philosophy of U++ allocator is that this does not matter.
Further explanation. The function that any allocator is using to obtain address space from system is mmap and there is munmap that returns address space to system. Normally there is a threshold - if block is too big, it allocation is simply handled by mmap / munmap calls, meaning it is returned to the system at MemoryFree. If it is less than threshold, bigger chung is mmaped from the system and then divided to smaller chunks (somehow).
Now what is different is that standard GCC allocator has thershold at 4MB. U++ allocatar at 224MB. In practive, this means that if you alloc / free 5MB block in std, it gets released back to system immediately. With U++, blocks up to 224 MB are not returned to the system immediately. If they are really unused, this just means that system will retrieve them when there is a need for more physical memory.
Mirek
|
|
|
|
Re: Core 2019 [message #51931 is a reply to message #51925] |
Fri, 21 June 2019 17:59 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
mirek wrote on Fri, 21 June 2019 03:16
Now what is different is that standard GCC allocator has thershold at 4MB. U++ allocatar at 224MB. In practive, this means that if you alloc / free 5MB block in std, it gets released back to system immediately. With U++, blocks up to 224 MB are not returned to the system immediately. If they are really unused, this just means that system will retrieve them when there is a need for more physical memory.
Mirek
This info doesn't match what I'm reading in the docs I posted above.
The lower limit for this parameter is 0. The upper limit is
DEFAULT_MMAP_THRESHOLD_MAX: 512*1024 on 32-bit systems or
4*1024*1024*sizeof(long) on 64-bit systems.
Note: Nowadays, glibc uses a dynamic mmap threshold by
default. The initial value of the threshold is 128*1024, but
when blocks larger than the current threshold and less than or
equal to DEFAULT_MMAP_THRESHOLD_MAX are freed, the threshold
is adjusted upward to the size of the freed block. When
dynamic mmap thresholding is in effect, the threshold for
trimming the heap is also dynamically adjusted to be twice the
dynamic mmap threshold. Dynamic adjustment of the mmap
threshold is disabled if any of the M_TRIM_THRESHOLD,
M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX parameters is set.
So, the old allocator on 64-bit systems had threshold 32Mb (4*1024*1024*sizeof(long)).
In the new one it is determined dynamically. Initial value is 128Kb.
I traced my app with a tool which intercepts all malloc/free calls. I know exactly what is stressing the allocator
Memory block sizes:
Regards,
Novo
|
|
|
|
Re: Core 2019 [message #51933 is a reply to message #51932] |
Fri, 21 June 2019 18:10 |
Novo
Messages: 1371 Registered: December 2006
|
Ultimate Contributor |
|
|
Unfortunately, I couldn't find a decent tool to track real amount of system memory (mmaped) used by an app.
A chart similar to one above would be very helpful, otherwise I can just compare most notable values.
Regards,
Novo
|
|
|
Goto Forum:
Current Time: Fri Sep 20 07:49:15 CEST 2024
Total time taken to generate the page: 0.02882 seconds
|
|
|