• 15
• 15
• 11
• 9
• 10

# Memory fences, speed and necessity.

This topic is 3558 days old which is more than the 365 day threshold we allow for new replies. Please post a new topic.

## Recommended Posts

I am trying to pass a list of pointers from one thread to another as fast as I possibly can. The solution I came up with is basically a ring buffer. It uses arrays of pointers in a linked list. When one "page" fills up it moves to the next one in the ring, adding a page if the next page has unread data. My exact implementation can be found below. Its written in asm, but the C++ equivalent is also included. For testing I made a simple program that passed a series of integers instead of pointers from one thread to another. I noticed a major performance degradation when running it a hyper-threading processor and an even greater degradation when run on a multicore system. My initial results for 100,000,000 integers were.
Quote:
 Athlon XP 1800+: 3.2sec Pentium 4(hyper-threading): 9.3sec Pentium D(mutlicore): 12.1sec
I found the real killer was the use of memory fences. After removing the fences my results for 100,000,000 integers were.
Quote:
 Athlon XP 1800+: 2.3sec Pentium 4(hyper-threading): .9sec Pentium D(mutlicore): .7sec
From 12.1sec to .7sec is a significant difference. I realize that my game will never need to pass 100,000,000 pointers from one thread to another in less than a second. But in the interest of optimization, I started looking for alternatives to memory fences. After reading the Intel and AMD developer's manuals I started to question if they are really even needed. I need a memory fence in two places basically. When I add a pointer I need to make sure the pointer is written before I move the write pointer, which means I need a store fence to enforce write ordering. When I read a pointer I have to make sure that the pointer is read before I move the read pointer, which means I need a full memory fence to make sure the write does not precede the read. From the Intel developer's manual:
Quote:
 Writes are not reordered with older reads. ... Writes to memory are not reordered with other writes, with the exception of writes executed with the CLFLUSH instruction and streaming stores (writes) executed with the non-temporal move instructions (MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD).
From the AMD developer's manual:
Quote:
 Generally, out-of-order writes are not allowed. Write instructions executed out-of-order cannot commit (write) their result to memory until all previous instructions have completed in program order. The processor can, however, hold the result of an out-of-order write instruction in a private buffer (not visible to software) until that result can be committed to memory.
This seems contrary to what I have read in many places on the internet before, including these forums. One common example is data being filled in an array, array[1] and array[0] can be written by the processor in any order. The compiler may reorder writes, but according to the information above writes cannot be reordered by processors under normal circumstances. In addition, the processor will not commit a write before a read that comes first in program order. Am I missing something here or do I really not need the memory fences?
#define SZ_PAGE		1000

struct SafeListPage
{
unsigned write;
SafeListPage* next;
unsigned data[SZ_PAGE];
};

class CSafeListPage
{
public:
CSafeListPage();
~CSafeListPage();
void __fastcall Write(unsigned data);
private:
SafeListPage* m_write;
};

CSafeListPage::CSafeListPage()
{
SafeListPage* page;
page=new SafeListPage;

page->write=0;
page->next=page;
m_write=page;
}

CSafeListPage::~CSafeListPage()
{
SafeListPage* start;
SafeListPage* cur;
SafeListPage* next;

start=m_write;
cur=start;
do
{
next=cur->next;
delete cur;
cur=next;
}
while(cur!=start);
}

__declspec(naked) void __fastcall CSafeListPage::Write(unsigned data)
{
/*
SafeListPage* page;
unsigned write;

write=m_write->write;		//get current write pos on page
m_write->data[write]=data;	//write data
if(++write<SZ_PAGE)		//if page full
{
__asm{sfence};		//make sure data is written before we inc the write pos
m_write->write=write;
return;
}
page=m_write->next;
{
page=new SafeListPage;	//insert a page between them
page->next=m_write->next;
m_write->next=page;
}
page->write=0;			//init page
__asm{sfence};			//make sure data and page init written before write pos
m_write->write=write;
m_write=page;
*/

__asm
{
push esi		//save register
//ecx=this
//edx=data
mov eax,[ecx]		//eax=m_write
mov esi,[eax]		//esi=m_write->write
mov [eax+12+esi*4],edx	//m_write->data[m_write->write]=data
cmp esi,SZ_PAGE
jge next_page
//sfence
mov [eax],esi		//m_write->write=esi
pop esi			//restore register
ret

next_page:
push ebx		//save register
mov ebx,[eax+8]		//ebx=m_write->next
cmp ebx,edx
jne set_page
mov ebx,ecx		//ebx=this
push SIZE SafeListPage
call new
mov ecx,[ebx]		//ecx=m_write
mov edx,[ecx+8]		//edx=m_write->next
mov [eax+8],edx		//new->next=m_write->next
mov [ecx+8],eax		//m_write->next=new
mov dword ptr [eax],0	//new->write=0
//sfence
mov [ecx],esi		//m_write->write=esi
mov [ebx],eax		//m_write=new
pop ebx			//restore registers
pop esi
ret

set_page:
mov dword ptr [ebx],0	//m_write->next->write=0
//sfence
mov [eax],esi		//m_write->write=esi
mov [ecx],ebx		//m_write=ebx
pop ebx			//restore registers
pop esi
ret
};
}

{
/*
unsigned data;

return 0;
return data;
__asm{mfence};				//make sure data read before we leave the page
return data;
*/

__asm
{
push ebx			//save register
//ecx=this
cmp edx,eax
jne get_data
xor eax,eax			//return 0
pop ebx				//restore register
ret

get_data:
cmp edx,SZ_PAGE			//if end of page
jge next_page
pop ebx				//restore register
ret

next_page:
//mfence
pop ebx				//restore register
ret
};
}



##### Share on other sites
I'm not sure I see why you need fences at all. In the single-reader/single-writer case you don't need any synchronization. In the Write() case what are you protecting against?
mov esi,[eax]		//esi=m_write-&gt;writemov [eax+12+esi*4],edx	//m_write-&gt;data[m_write-&gt;write]=dataadd esi,1		//esi+=1cmp esi,SZ_PAGEjge next_page//sfencemov [eax],esi		//m_write-&gt;write=esi

You want to make sure mov esi,[eax] and mov [eax],esi don't conflict in case the latter doesn't complete by the time you execute the former on the next call? You don't need a fence for that.

##### Share on other sites
Quote:
 Original post by wprintfFrom 12.1sec to .7sec is a significant difference. I realize that my game will never need to pass 100,000,000 pointers from one thread to another in less than a second. But in the interest of optimization, I started looking for alternatives to memory fences.

Optimization is never of interest performance is of interest, which is affected by optimization. Synthetic bench marks mean nothing because they are often wrong. Have you tested an actual use scenario? Typically you don't need memory fences unless you have multiple readers/writers.

##### Share on other sites
Quote:
 Original post by outRiderI'm not sure I see why you need fences at all. In the single-reader/single-writer case you don't need any synchronization. In the Write() case what are you protecting against?

I'm trying to make sure the data is written before the new write position is stored. Once the write position is moved, the data can be read by Read(). To be more specific.
mov [eax+12+esi*4],edx	//this must happenadd esi,1cmp esi,SZ_PAGEjge next_page//sfencemov [eax],esi			//before this

Quote:
 Original post by outRiderOptimization is never of interest performance is of interest, which is affected by optimization. Synthetic bench marks mean nothing because they are often wrong. Have you tested an actual use scenario? Typically you don't need memory fences unless you have multiple readers/writers.

I realize that just because one function in my game is 17 times faster, does not mean that it will have an even noticeable affect on performance. In this case I know it won't, because Read()/Write() will only be called less than two times per frame on average. But I know that it will not be a detriment to performance, unless something is very wrong in the code somewhere.

Just to reiterate, my question is are memory fences really needed to enforce read/write ordering? The AMD and Intel manuals suggest that unless CLFLUSH or non-temporal move instructions are used, the processor can only reorder reads. Or am I missing something?

##### Share on other sites
First, I really think you mean to say that you want mov esi,[eax] to happen before mov [eax],esi because there is a data dependency there. There is no dependency between mov [eax+12+esi*4],edx and mov [eax],esi, the latter does not change any of the registers the former uses to form its address, nor do they access the same memory location.

Anyway, you can be sure that writes to [eax] will be seen by all subsequent reads from the same [eax] on the same processor because out-of-order execution wouldn't work otherwise; they may issue out of order, but they won't execute out of order because the load will snoop the store queue to see if any stores to that address are currently pending and will wait if so. Use the fence instructions when you want writes to be globally visible at a certain point in time, i.e. to other processors, devices on the peripheral buses, etc after an sfence instruction completes. Edit: Or for those SSE moves that don't observe ordering I guess, I have no experience with them.

##### Share on other sites
I guess I really haven't explained myself very well. As thread A is Write()'ing pointers to the list it increments the write position in an array. When the write position reaches the end of the array, it moves to the next array in the linked list. As thread B is Read()'ing pointers from the list it compares the read position with the write position in the array currently being read. If they match, it returns zero because there is nothing to be read, otherwise it returns the next pointer in the array and increments the read position. After it has read the last element of the array it moves to the next array in the linked list.

So Read() is loading two DWORDs that are written by Write(). First it reads the write position in the array, then if there is any unread data in the array it will return it. The pointer written to the array must be globally visible before the write position is globally visible. If the processor writes the write position before the pointer, it is possible for Read() to get a bad pointer.

Quote:
 Use the fence instructions when you want writes to be globally visible at a certain point in time, i.e. to other processors, devices on the peripheral buses, etc after an sfence instruction completes.

That is my question. Is it really nessissary to use sfence under normal circumstances to force writes to be globally visible (even to other processors) in program order? Doesn't the processor already ensure that they will be globally visible in program order?

##### Share on other sites
OK I gotcha.

You're safe. For normal x86 store instructions there are guarantees, reordering stores after stores are not allowed, where as PowerPC, Itanium and others allow it. You won't see a new value at B before you see a new value at A. You may see them at the same time (due to write combining), and you may not see them for a while (due to store queuing), but you should see them in order from a software POV. I think sfence and friends are there for SSE non-temporal loads/stores that may commit out of order.