Jump to content

  • Log In with Google      Sign In   
  • Create Account

#Actualc-up

Posted 09 May 2013 - 01:55 PM

I keep mentioning shared memory. Here's how the scatter part of RadixSortP above might look if shared memory was used to parallelise it. Note the shared flag on the two output arrays of RadixScatter. All shared does is prevents the job manager making a dependency between this shared memory and any other shared memory, which matters in this case because we're passing the entire itemsAlt and prefixes arrays to each call to RadixScatter and without shared the calls would all depend on each other and be serialised.

 

I haven't compiled the code below but it should illustrate the concept.

 

// parallelism occurs at function scope, so we must introduce a new scatter function
parallel void RadixScatter<T>(shared T[] local itemsOut, shared int[] local prefixes, const(T)[] local itemsIn, int shift)
{
	foreach (const(T)& item; itemsIn)
	{
		// AtomicAdd returns the old value
		int o = AtomicAdd(&prefixes[(item.RadixSortKey << shift) >> 24], 1);
		itemsOut[o] = *item;
	}
}


// and here's the modified scatter part in RadixSortP
int split = items.Length / 4;	// arbitrarily split 4 ways
parallel
{
	int s = 0;
	for (int i = 0; i < 3; i++, s += split)
		RadixScatter<T>(itemsAlt, prefixes, items[s .. s + split], shift);
	// remainder
	RadixScatter<T>(itemsAlt, prefixes, items[s .. items.Length], shift);
}


#6c-up

Posted 09 May 2013 - 02:00 AM

I keep mentioning shared memory. Here's how the scatter part of RadixSortP above might look if shared memory was used to parallelise it. Note the shared flag on the two output arrays of RadixScatter. All shared does is prevents the job manager making a dependency between this shared memory and any other shared memory, which matters in this case because we're passing the entire itemsAlt and prefixes arrays to each call to RadixScatter and without shared the calls would all depend on each other and be serialised.

 

I haven't compiled the code below but it should illustrate the concept.

 

// parallelism occurs at function scope, so we must introduce a new scatter function
parallel void RadixScatter<T>(shared T[] itemsOut, shared int[] prefixes, const(T)[] itemsIn, int shift)
{
	foreach (const(T)& item; itemsIn)
	{
		// AtomicInc returns the value before incrementing
		int o = AtomicInc(&prefixes[(item.RadixSortKey << shift) >> 24]);
		itemsOut[o] = *item;
	}
}


// and here's the modified scatter part in RadixSortP
int split = items.Length / 4;	// arbitrarily split 4 ways
parallel
{
	int s = 0;
	for (int i = 0; i < 3; i++, s += split)
		RadixScatter<T>(itemsAlt, prefixes, items[s .. s + split], shift);
	// remainder
	RadixScatter<T>(itemsAlt, prefixes, items[s .. items.Length], shift);
}


#5c-up

Posted 09 May 2013 - 02:00 AM

I keep mentioning shared memory. Here's how the scatter part of RadixSortP above might look if shared memory was used. Note the shared flag on the two output arrays of RadixScatter. All shared does is prevents the job manager making a dependency between this shared memory and any other shared memory, which matters in this case because we're passing the entire itemsAlt and prefixes arrays to each call to RadixScatter and without shared the calls would all depend on each other and be serialised.

 

I haven't compiled the code below but it should illustrate the concept.

 

// parallelism occurs at function scope, so we must introduce a new scatter function
parallel void RadixScatter<T>(shared T[] itemsOut, shared int[] prefixes, const(T)[] itemsIn, int shift)
{
	foreach (const(T)& item; itemsIn)
	{
		// AtomicInc returns the value before incrementing
		int o = AtomicInc(&prefixes[(item.RadixSortKey << shift) >> 24]);
		itemsOut[o] = *item;
	}
}


// and here's the modified scatter part in RadixSortP
int split = items.Length / 4;	// arbitrarily split 4 ways
parallel
{
	int s = 0;
	for (int i = 0; i < 3; i++, s += split)
		RadixScatter<T>(itemsAlt, prefixes, items[s .. s + split], shift);
	// remainder
	RadixScatter<T>(itemsAlt, prefixes, items[s .. items.Length], shift);
}


#4c-up

Posted 09 May 2013 - 01:59 AM

I keep mentioning shared memory. Here's how the scatter part of RadixSortP above might look if shared memory was used. Note the shared flag on the two output arrays of RadixScatter. All shared does is prevents the job manager making a dependency between this shared memory and any other shared memory, which matters in this case because we're passing the entire itemsAlt and prefixes arrays to each call to RadixScatter so without shared they would all depend on each other and be serialised.

 

I haven't compiled the code below but it should illustrate the concept.

 

// parallelism occurs at function scope, so we must introduce a new scatter function
parallel void RadixScatter<T>(shared T[] itemsOut, shared int[] prefixes, const(T)[] itemsIn, int shift)
{
	foreach (const(T)& item; itemsIn)
	{
		// AtomicInc returns the value before incrementing
		int o = AtomicInc(&prefixes[(item.RadixSortKey << shift) >> 24]);
		itemsOut[o] = *item;
	}
}


// and here's the modified scatter part in RadixSortP
int split = items.Length / 4;	// arbitrarily split 4 ways
parallel
{
	int s = 0;
	for (int i = 0; i < 3; i++, s += split)
		RadixScatter<T>(itemsAlt, prefixes, items[s .. s + split], shift);
	// remainder
	RadixScatter<T>(itemsAlt, prefixes, items[s .. items.Length], shift);
}


#3c-up

Posted 09 May 2013 - 01:58 AM

I keep mentioning shared memory. Here's how the scatter part of RadixSortP above might look if shared memory was used. Note the shared flag on the two output arrays of RadixScatter. All shared does is prevents the job manager making a dependency between this shared memory and any other shared memory, which matters in this case because we're passing the entire array to each call to RadixScatter so without shared they would all depend on each other and be serialised.

 

I haven't compiled the code below but it should illustrate the concept.

 

// parallelism occurs at function scope, so we must introduce a new scatter function
parallel void RadixScatter<T>(shared T[] itemsOut, shared int[] prefixes, const(T)[] itemsIn, int shift)
{
	foreach (const(T)& item; itemsIn)
	{
		// AtomicInc returns the value before incrementing
		int o = AtomicInc(&prefixes[(item.RadixSortKey << shift) >> 24]);
		itemsOut[o] = *item;
	}
}


// and here's the modified scatter part in RadixSortP
int split = items.Length / 4;	// arbitrarily split 4 ways
parallel
{
	int s = 0;
	for (int i = 0; i < 3; i++, s += split)
		RadixScatter<T>(itemsAlt, prefixes, items[s .. s + split], shift);
	// remainder
	RadixScatter<T>(itemsAlt, prefixes, items[s .. items.Length], shift);
}


#2c-up

Posted 09 May 2013 - 01:57 AM

I keep mentioning shared memory. Here's how the scatter part of RadixSortP above might look if shared memory was used. Note the shared flag on the two output arrays of RadixScatter. All shared does is prevents the job manager making a dependency between this shared memory and any other shared memory, which matters in this case because we're passing the entire array to each call to RadixScatter so without shared they would all depend on each other and be serialised.

 

I haven't compiled the code below but it should illustrate the concept.

 

// parallelism occurs at function scope, so we must introduce a new scatter function
parallel void RadixScatter<T>(shared T[] itemsOut, shared int[] prefixes, const(T)[] itemsIn, int shift)
{
	foreach (const(T)& item; itemsIn)
	{
		// AtomicInc returns the value before incrementing
		int o = AtomicInc(&prefixes[(item.RadixSortKey << shift) >> 24]);
		itemsOut[o] = *item;
	}
}


// and here's the modified scatter part in RadixSortP
int split = items.Length / 4;	// arbitrarily split 4 ways
parallel
{
	int s = 0;
	for (int i = 0; i < 3; i++, s += split)
		RadixScatter<T>(itemsAlt, prefixes, items[s .. s + split], shift);
	// remainder
	RadixSortCountPrefixes<T>(itemsAlt, prefixes, items[s .. items.Length], shift);
}


PARTNERS