Overview
Examples
Screenshots
Comparisons
Applications
Download
Documentation
Tutorials
Bazaar
Status & Roadmap
FAQ
Authors & License
Forums
Funding Ultimate++
Search on this site
Search in forums












SourceForge.net Logo
Home » Developing U++ » U++ Developers corner » What does SSE2 usage enhance?
What does SSE2 usage enhance? [message #39622] Wed, 10 April 2013 13:31 Go to next message
crydev is currently offline  crydev
Messages: 151
Registered: October 2012
Location: Netherlands
Experienced Member
A question got up to me. What exactly are the pro's of using SSE2 as flag in your build? What kind of things can it speed up if the processor supports it?
Re: What does SSE2 usage enhance? [message #39658 is a reply to message #39622] Mon, 15 April 2013 18:34 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
crydev wrote on Wed, 10 April 2013 07:31

A question got up to me. What exactly are the pro's of using SSE2 as flag in your build? What kind of things can it speed up if the processor supports it?


SSE2 tells compiler to use SSE2 for FP arithmetics in 32-bit SSE2 x86 builds. (In 64-bit builds, it is alsways on).

Cons: Executable will not work on CPUs without SSE2 - SSE2 is supported since PentiumIV and Athlon 64, basically for more than 10 years now...

Mirek
Re: What does SSE2 usage enhance? [message #39682 is a reply to message #39658] Wed, 17 April 2013 10:54 Go to previous messageGo to next message
crydev is currently offline  crydev
Messages: 151
Registered: October 2012
Location: Netherlands
Experienced Member
Thanks Mirek for your reply. Does that mean a SSE enabled version of functions as memcpy() can be used in U++? Or can I assume that by enabling SSE2 in the compiler flags automatically enables a SSE2 enabled version of memcpy? I think the Windows stock function is fairly slow.

p.s. My situation is copying blocks of 1~8 bytes very fast and very frequently in a loop. I have a chunk of memory which I loop though. Say my loop index currently is in the middle of my memory chunk, than it copies 4 bytes from the current index.
Re: What does SSE2 usage enhance? [message #39684 is a reply to message #39682] Wed, 17 April 2013 11:17 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
crydev wrote on Wed, 17 April 2013 04:54

Thanks Mirek for your reply. Does that mean a SSE enabled version of functions as memcpy() can be used in U++? Or can I assume that by enabling SSE2 in the compiler flags automatically enables a SSE2 enabled version of memcpy? I think the Windows stock function is fairly slow.



SSE2 flag is likely to have no impact on memcpy. If it is emitted as function call (not intrinsics), it is likely that the function is SSE2 optimized anyway. For intrinsics, SSE2 code is way too complicated.

The main difference is in code using FP arithmetics: without SSE2, it is using x87 FP stack, with SSE2 it is using XMM register file, which is potentially faster.

Mirek
Re: What does SSE2 usage enhance? [message #39685 is a reply to message #39682] Wed, 17 April 2013 11:20 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
crydev wrote on Wed, 17 April 2013 04:54


p.s. My situation is copying blocks of 1~8 bytes very fast and very frequently in a loop. I have a chunk of memory which I loop though. Say my loop index currently is in the middle of my memory chunk, than it copies 4 bytes from the current index.


BTW, you might have a look at SVO_MEMCPY.
Re: What does SSE2 usage enhance? [message #39691 is a reply to message #39685] Wed, 17 April 2013 22:04 Go to previous messageGo to next message
crydev is currently offline  crydev
Messages: 151
Registered: October 2012
Location: Netherlands
Experienced Member
mirek wrote on Wed, 17 April 2013 11:20

crydev wrote on Wed, 17 April 2013 04:54


p.s. My situation is copying blocks of 1~8 bytes very fast and very frequently in a loop. I have a chunk of memory which I loop though. Say my loop index currently is in the middle of my memory chunk, than it copies 4 bytes from the current index.


BTW, you might have a look at SVO_MEMCPY.


Thanks Mirek for your suggestion. I tried it but that one is actually a lot slower than the conventional memcpy. I stepped through the ASM though and it seems that the default memcpy is already partially optimzed for SSE2.
Re: What does SSE2 usage enhance? [message #39694 is a reply to message #39691] Thu, 18 April 2013 06:34 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
crydev wrote on Wed, 17 April 2013 16:04

mirek wrote on Wed, 17 April 2013 11:20

crydev wrote on Wed, 17 April 2013 04:54


p.s. My situation is copying blocks of 1~8 bytes very fast and very frequently in a loop. I have a chunk of memory which I loop though. Say my loop index currently is in the middle of my memory chunk, than it copies 4 bytes from the current index.


BTW, you might have a look at SVO_MEMCPY.


Thanks Mirek for your suggestion. I tried it but that one is actually a lot slower than the conventional memcpy. I stepped through the ASM though and it seems that the default memcpy is already partially optimzed for SSE2.


Are you really copying just 1-8 bytes? For such small amounts and unaligned data, SSE2 is IMO meaningless.

Mirek
Re: What does SSE2 usage enhance? [message #39696 is a reply to message #39694] Thu, 18 April 2013 08:33 Go to previous messageGo to next message
crydev is currently offline  crydev
Messages: 151
Registered: October 2012
Location: Netherlands
Experienced Member
mirek wrote on Thu, 18 April 2013 06:34

crydev wrote on Wed, 17 April 2013 16:04

mirek wrote on Wed, 17 April 2013 11:20

crydev wrote on Wed, 17 April 2013 04:54


p.s. My situation is copying blocks of 1~8 bytes very fast and very frequently in a loop. I have a chunk of memory which I loop though. Say my loop index currently is in the middle of my memory chunk, than it copies 4 bytes from the current index.


BTW, you might have a look at SVO_MEMCPY.


Thanks Mirek for your suggestion. I tried it but that one is actually a lot slower than the conventional memcpy. I stepped through the ASM though and it seems that the default memcpy is already partially optimzed for SSE2.


Are you really copying just 1-8 bytes? For such small amounts and unaligned data, SSE2 is IMO meaningless.

Mirek


Yes it is only about small blocks of 1 ~ 8 bytes. I'll keep it at memcpy. It is good enough. Thanks Smile
Re: What does SSE2 usage enhance? [message #39700 is a reply to message #39696] Thu, 18 April 2013 10:38 Go to previous messageGo to next message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
Well, I am asking because in that case SVO_MEMCPY has a problem (it should be optimization over memcpy for small blocks).

Can you show me a small code snippet?

Mirek
Re: What does SSE2 usage enhance? [message #39702 is a reply to message #39700] Thu, 18 April 2013 11:43 Go to previous messageGo to next message
crydev is currently offline  crydev
Messages: 151
Registered: October 2012
Location: Netherlands
Experienced Member
Here is a small code snippet. If you see anything that is strange or wrong I would appreciate feedback. Smile

The memcpy line is the line where I copy, in this case, 4 bytes from the address of i in array buffer to float variable tempStore.

template<>
void MemoryScanner::ScanWorker(const MemoryRegion& region, const float& value)
{
	Byte *buffer = (Byte*)MemoryAlloc(region.MemorySize);
	if (!ReadProcessMemory(this->mOpenedProcessHandle, (void*)region.BaseAddress, buffer
		, region.MemorySize, NULL))
	{
		MemoryFree(buffer);
		return;
	}
	
	Vector<MemoryBlockBase*> localResults;
	
	for (int i = 0; i < region.MemorySize; i++)
	{
		float tempStore;
		memcpy(&tempStore, &(buffer[i]), sizeof(float));
		
		if (TemplateCompare(tempStore, value)) // WRITES TO FREED BLOCKS DETECTED
		{
			MemoryBlock<float>* mb = new MemoryBlock<float>();
			mb->Address = static_cast<unsigned int>(region.BaseAddress + i);
			mb->Size = sizeof(float);
			mb->Buffer = tempStore;
			localResults.Add(mb);
		}
	}

	MemoryFree(buffer);

	AtomicInc(this->ThreadFinishCount);

	if (localResults.GetCount() > 0)
	{
		this->AddThreadSpecificSearchResults(localResults);
	}

	this->UpdateScanningProgress(this->ThreadFinishCount);
}
Re: What does SSE2 usage enhance? [message #39703 is a reply to message #39702] Thu, 18 April 2013 11:58 Go to previous message
mirek is currently offline  mirek
Messages: 13975
Registered: November 2005
Ultimate Member
I see, I have checked the assembly and memcpy gets inlined with unaligned load, which is still faster than loading 4 bytes separated like SVO_MOVE does.

So the actual code for this memcpy is

00401744  mov ecx,[eax] 
00401746  mov [esp+0x4],ecx


(SVO_MOVE is designed for small variable size).

Mirek
Previous Topic: What does !! in e.g. FtpClient class mean?
Next Topic: Stripping U++
Goto Forum:
  


Current Time: Fri Mar 29 08:18:51 CET 2024

Total time taken to generate the page: 0.01992 seconds