Jump to content

  • Log In with Google      Sign In   
  • Create Account


c-up

Member Since 29 Apr 2013
Offline Last Active Oct 11 2013 03:23 PM

#5074459 New professional programming language

Posted by c-up on 01 July 2013 - 09:59 AM

It doesn't use CUDA or DirectX, it's OpenCL and OpenGL. I've just installed Ubuntu on my PC at home so watch this space.




#5073851 New professional programming language

Posted by c-up on 29 June 2013 - 05:44 AM


Windows only!?!? You are a traitor to humanity.

 

Well, I don't like humans much, so good tongue.png

 

Are you saying you'd use it if I did a Linux version? That can be arranged.




#5071996 New professional programming language

Posted by c-up on 22 June 2013 - 05:43 AM

Hi folks,

 

As promised version 0.3 has just been uploaded to the website with a full source code release.

 

This version has libraries for audio (OpenAL) and networking (based on sockets but wrapped in classes.)

 

It also has support for touch input (see the OpenGL sample) and improvements to the way generics are implemented.

 

Please let me know if you have any problems or other feedback.

 

Thanks.




#5062005 New professional programming language

Posted by c-up on 15 May 2013 - 04:04 AM

It just occurred to me reading this that when I talk about releasing source code I just mean the c++ that comprises the compiler itself and runtime systems (garbage collector, job manager, etc.)

 

All of the actual C-UP source code is already available as it's embedded in the object files that are part of the download. It can be extracted out into individual files using the 'extract' command as described in the user guide. This includes source for things like the rigid body library, the debugger itself, collections, parallel collections, string formatting, opengl/cl bindings, etc.

 

You can also just use the debugger to browse source code if you wish (and even view the generated assembly language interleaved with the source.)




#5061799 2d Platformer Gravity - constant or conditional!?

Posted by c-up on 14 May 2013 - 09:22 AM

Been a while since I wrote a platformer but the main reason not to apply gravity when on the ground is to prevent your character gradually sliding down slopes.

 

Whether you see this issue in the presence of slopes depends upon how you respond to colllisions of course:

- if you just say he's hit a 'floor' and move him straight up to resolve the collision then you won't see this problem

- if you move him in the direction of the normal of the surface then he will slide unless you also implement a friction model

- however, a general friction model can be difficult to get feeling precise enough for a platformer

 

In my opinion the no gravity option is easier to work with for getting very precise platforming mechanics but the gravity always on version can give you nicer secondary effects like little skids and bounces when you land a jump.

 

If all of your platforms are flat (i.e. no slopes) then there's not much in it.




#5060496 New professional programming language

Posted by c-up on 09 May 2013 - 01:57 AM

I keep mentioning shared memory. Here's how the scatter part of RadixSortP above might look if shared memory was used to parallelise it. Note the shared flag on the two output arrays of RadixScatter. All shared does is prevents the job manager making a dependency between this shared memory and any other shared memory, which matters in this case because we're passing the entire itemsAlt and prefixes arrays to each call to RadixScatter and without shared the calls would all depend on each other and be serialised.

 

I haven't compiled the code below but it should illustrate the concept.

 

// parallelism occurs at function scope, so we must introduce a new scatter function
parallel void RadixScatter<T>(shared T[] local itemsOut, shared int[] local prefixes, const(T)[] local itemsIn, int shift)
{
	foreach (const(T)& item; itemsIn)
	{
		// AtomicAdd returns the old value
		int o = AtomicAdd(&prefixes[(item.RadixSortKey << shift) >> 24], 1);
		itemsOut[o] = *item;
	}
}


// and here's the modified scatter part in RadixSortP
int split = items.Length / 4;	// arbitrarily split 4 ways
parallel
{
	int s = 0;
	for (int i = 0; i < 3; i++, s += split)
		RadixScatter<T>(itemsAlt, prefixes, items[s .. s + split], shift);
	// remainder
	RadixScatter<T>(itemsAlt, prefixes, items[s .. items.Length], shift);
}




#5060407 New professional programming language

Posted by c-up on 08 May 2013 - 03:55 PM

Hodgman mentioning radix sort got me wondering so I've knocked up an implementation.

 

On my 2 core laptop (AMD E-450 APU 1.65GHz), sorting 1024*1024 items takes:

1 core = 678ms

2 core = 368ms (1.84x speedup)

 

On my 4 core desktop (AMD Athalon 2 X4 630 2.8GHz), sorting the same number of items takes:

1 core = 333ms

2 core = 190ms (1.75x)

3 core = 136ms (2.45x)

4 core = 106ms (3.14x)

 

Pretty reasonably speedups.

 

Anyway, here's the radix sort code. It recursively subdivides the problem to generate the parallelism as described here:

http://software.intel.com/sites/default/files/m/b/3/6/3/8/22396-ParallelRadixSort_andreyryabov.pdf

 

It doesn't do parallel scatter as Hodgman suggested for reasons commented on in the code. I'd be interested to know if there's a better approach than I took.

 

As well as it being parallel internally you can also run multiple (non-overlapping) sorts in parallel if you need to.

 

// parallel radix sort
// in-place stable sort of items array into ascending key order
// type to be sorted must have RadixSortKey uint property
// you can pass in temporary workspace otherwise it will allocate one internally on the stack
public parallel void RadixSort<T>(T[] local items, T[] local itemsTemp = null)
{
	// workspace required for double buffering
	T[] local itemsAlt = itemsTemp;
	if (!itemsAlt)
		itemsAlt = new local T[items.Length];// = void;			// = void prevents zeroing the allocated memory (in theory - currently crashes compiler)

	if (items.Length > 4096)
		RadixSortP<T>(items, itemsAlt, 0);
	else
		RadixSortNP<T>(items, itemsAlt, 0);
}

private parallel void RadixSortP<T>(T[] local items, T[] local itemsAlt, int shift)
{
	// calculate prefix values
	int[] local prefixes = new local int[256];

	// if lots of keys then calculate prefixes in parallel
	if (items.Length >= 32768)
	{
		int split = items.Length / 4;
		int[] local [4] prefixesPar;
		for (int i = 0; i < 4; i++)
			prefixesPar[i] = new local int[256];
		parallel
		{
			int s = 0;
			for (int i = 0; i < 3; i++, s += split)
				RadixSortCountPrefixes<T>(prefixesPar[i], items[s .. s + split], shift);
			// remainder
			RadixSortCountPrefixes<T>(prefixesPar[3], items[s .. items.Length], shift);
		}

		// sum the prefixes using simd add (stack allocated arrays are at least 16 byte aligned)
		foreach (int[] local pfxPar; prefixesPar)
		{
			int8[] local pfxSrc8 = cast(int8[] local)pfxPar;
			int8[] local pfxDst8 = cast(int8[] local)prefixes;
			foreach (int8& pfx, i; pfxSrc8)
				pfxDst8[i] += *pfx;
		}
	}
	// if not a large number of prefixes do them in serial
	else
	{
		RadixSortCountPrefixes<T>(prefixes, items, shift);
	}

	// prefix is sum of all previous values
	int sum = 0;
	foreach (int& prefix, i; prefixes)
	{
		int newSum = sum + *prefix;
		*prefix = sum;
		sum = newSum;
	}

	// scatter values out to destination array
	// can be done in parallel if shared memory is used but:
	//	- you'd need to use atomic increments which would kill performance if you have lots of the same key
	//	- would make the sort unstable
	// (maybe there's another way to partition the work I'm not seeing that avoids these issues)
	foreach (T& item; items)
	{
		int o = prefixes[(item.RadixSortKey << shift) >> 24]++;
		itemsAlt[o] = *item;
	}

	// done?
	if (shift < 24)
	{
		// now sort the sub-arrays in parallel if they're reasonably large or non-parallel if they're not
		parallel
		{
			int s = 0;
			for (int i = 0; i < 256; i++)
			{
				int e = prefixes[i];
		
				if ((e - s) > 4096)
					RadixSortP<T>(itemsAlt[s .. e], items[s .. e], shift + 8);
				else if ((e - s) > 1)
					RadixSortNP<T>(itemsAlt[s .. e], items[s .. e], shift + 8);
				else if ((e - s) > 0)
					items[s] = itemsAlt[s];

				s = e;
			}
		}
	}
}

// non-parallel version called by the master version above when arrays are sufficiently small
private void RadixSortNP<T>(T[] local items, T[] local itemsAlt, int shift) : parallel
{
	// calculate prefix values
	int[256] prefixes;

	RadixSortCountPrefixes<T>(&prefixes, items, shift);

	// prefix is sum of all previous values
	int sum = 0;
	foreach (int& prefix, i; prefixes)
	{
		int newSum = sum + *prefix;
		*prefix = sum;
		sum = newSum;
	}

	// scatter values out to destination array
	foreach (T& item; items)
	{
		int o = prefixes[(item.RadixSortKey << shift) >> 24]++;
		itemsAlt[o] = *item;
	}

	// if not done sort the sub-arrays
	if (shift < 24)
	{
		int s = 0;
		for (int i = 0; i < 256; i++)
		{
			int e = prefixes[i];

			if ((e - s) > 1)
				RadixSortNP<T>(itemsAlt[s .. e], items[s .. e], shift + 8);
			else if ((e - s) > 0)
				items[s] = itemsAlt[s];

			s = e;
		}
	}
}

// radix sort helper
parallel void RadixSortCountPrefixes<T>(int[] local prefixes, const(T)[] local items, int shift)
{
	foreach (const(T)& item; items)
		prefixes[(item.RadixSortKey << shift) >> 24]++;
}

 

And here's the little testbed I used to generate the results at the top:

 

import System.IO.Console;

uint randSeed = 0;

public void SeedRand(uint seed)
{
	randSeed = seed;
}

public uint Rand()
{
	randSeed = (randSeed * 1103515245 + 12345) & int.MaxValue;
	return randSeed;
}

public struct RadixItem
{
	private uint key;
	private int value;

	public this()
	{
		key = Rand();
		value = Rand();
	}

	public uint RadixSortKey() const : parallel
	{
		return key;
	}
}

public int Main(string[] args)
{
	SeedRand(12345);

	static const int numItems = 1024 * 1024;
	static const int numRuns = 10;

	RadixItem[] items = new RadixItem[numItems];
	RadixItem[] itemsTmp = new RadixItem[numItems];

	// uncomment this line to set number of threads job queue can use (it clamps internally to 1 .. number of physical cores on machine)
	//System.Profile.JobQueue.SetNumThreads(2);

	// run radix sort performance test
	Console.WriteLine("Timing radix sort of %d items", &numItems);
	ulong totalCycles = 0;

	// 10 runs
	for (int r = 0; r < numRuns; r++)
	{
		foreach (RadixItem& item; items)
			*item = RadixItem();

		ulong startCycle = GetProcessorCycleCount();

		RadixSort<RadixItem>(items, itemsTmp);

		ulong endCycle = GetProcessorCycleCount();
		totalCycles += endCycle - startCycle;

		// check sorted
		for (int i = 1; i < items.Length; i++)
			Assert(items[i].RadixSortKey >= items[i - 1].RadixSortKey);
	}

	// report average time taken
	float totalMs = cast(float)totalCycles / cast(float)GetProcessorCyclesPerSecond() / cast(float)numRuns * 1000.0f;
	Console.WriteLine("Average time taken over %d runs = %.2fms", &numRuns, &totalMs);

	return 0;
}



#5059435 New professional programming language

Posted by c-up on 05 May 2013 - 05:41 AM

I've fixed all of the issues you raised. The new version is on the website (I just overwrote the existing one.)

 

The crash was a divide by zero when the debugger was minimized (yes, it has been that well tested.)

 

Files will open in the debugger if they have one of these extensions: .cup .cl .frag .vert .glslv .glslf .glv .glf .txt

 

A couple of other things have changed:

- The optimiser works now (at least on all the tests I've run). It doesn't do that much optimisation yet, but some.

- Windows raw input is used instead of DirectInput so input.dll is no longer needed and you don't need to call Input.Init/Shutdown/Poll any more.

 

Thanks again,

Si




#5059125 New professional programming language

Posted by c-up on 04 May 2013 - 05:56 AM

You're not wrong about having my work cut out but I've been doing it for 4 years and have no intention of stopping now.




#5059107 New professional programming language

Posted by c-up on 04 May 2013 - 03:32 AM

So it seems the right thing to do is to release full source code which I'll do from the next version to give me a chance to tidy things up

a bit and separate out the platform dependent stuff better.

 

Here's a roadmap of my plans for the rest of this year.

 

1. Simple optimiser (done)

2. Fully test SSE4 back end and half floats

3. Remove DirectInput - use Windows raw input instead (done)

4. Require explicit type on generic parameters so you can overload generics on more than just number of generic parameters

5. Audio library

6. Sockets library

 

- v0.3 release with source code (guessing mid June)

 

7. Vectors of longs (not all that useful but good preparation for 9)

8. Scatter/gather array access (see bottom of user guide for loose specification)

9. Vectors of pointers with scatter/gather/downcast

10. Shift vector by vector

 

- v0.4 release

 

11. Raspberry Pi implementation (Linux runtime, ARM backend)

 

- v0.5 release

 

12. x64 runtime and back-end

13. AVX/AVX2 support (buy new PC - find out if AVX2 gather is worth it!)

14. Improved optimiser

 

- v0.6 release




#5058866 New professional programming language

Posted by c-up on 03 May 2013 - 04:19 AM

Hi Hodgman. Yes it can determine every byte the the parallel function calls can modify. This is because it only allows modification via what I call local references. Simply put, local references can only exist in local scopes which fundamentally limits the scope of the data graph that can be accessed for writing (there's a much more detailed explanation in the docs.) That's not to say that the data you're modifying has to be local, but the references to it are tagged as local (e.g. "byte[] local" in the example above.)

 

The example doesn't show it but you can also declare that a particular set of types and/or memory heaps (it supports multiple heaps) are to be treated as constant during a particular parallel function call and any data that is one of these types or can be statically proven to be on one of these heaps can be freely accessed in parallel code but not modified.

 

This system is perfectly robust to my knowledge and one of my greatest fears in life is that someone will find that it isn't.

 

Yes, the end of the parallel scope acts as a join point.

 

You can't interfere with this stuff using threads because I don't allow you to use threads. There's a big rant at the end of the user guide about why I think threads are a terrible idea which might very well be a bone of contention. Anyway, you'd have to write your parallel radix sort using the parallel features of the language. There is support for shared memory though if you're an advanced user which allows you to manage access to memory yourself in a way that bypassed the dependency checking. The parallel collections provided with the language use this feature themselves and I guess you could do your sort example the same way.

 

 

Thanks for your points about swaying people to use the language - they are well taken. I know I wouldn't commit to someone elses language lightly. I'm going to have a serious think about the best way to get this into a releasable state. Hopefully I won't get hit by a bus in the meantime, but working in London you never know...

 

By the way, you mentioned interop with C which is easy because C-UP generates native code that can call C APIs natively. Please see the OpenGL and OpenCL bindings and sample provided if you need persuading :-)




#5057855 New professional programming language

Posted by c-up on 29 April 2013 - 01:50 PM

Hi game developers,

I've invented a new programming language based on over 20 years of experience writing games for a living and I'd like
to invite you all to give it a try.

The language is called C-UP, which is a play on the classic gaming terminology for lives/turns/power-ups: 1-up, 2-up, etc.

It's a C style imperative language so it should look very familiar to most of you but I hope it introduces enough new
concepts to make it worthwhile. I guess the headline features are safe automatic parallelism, full support for
SIMD vector types, proper arrays with slicing, multi-methods and generics, but there's lots of other good stuff.

Vector math is all SIMD based with intrinsic functions for things like dot, cross, min, max, sin, cos, etc.
There are libraries for collections, matrices, quaternions, window management, OpenGL, OpenCL, input, file system and
3d rigid body dynamics. Audio and networking are at the top of my list of libraries to implement next although you are
free to make your own.

There's also a debugger written in C-UP itself.

You can download it for free (Windows only) at: www.c-up.net.

Sorry for the basic website implementation but all of my time is being devoted to the language itself.

Unfortunately this means that there are no forums or anything yet but you can email me at the address given on the site
and of course discuss here. I'll try to answer any questions you have but it is just me and I do have a full time job in the
industry as well so please bear with me.

Thanks in advance for any time you're able to spend looking at this project - I hope you like it,

Si



P.S. Here's a little C-UP program to give you a taste of the language. This example draws some 2d coloured squares in parallel
using simd vector maths and 2d arrays with slicing.

 

// entry point
int Main(string[] args)
{
    // allocate a 2d dynamic array of 4 byte simd vector type
    // (a comprehensive range of simd types is supported)
    byte4[,] image = new byte4[64, 64];

    parallel
    {
        // fill the top-left quarter in red (array slicing with .. allows us to make a sub-array aliased over the master array)
        BlendRect(image[0..32, 0..32], byte4(255, 0, 0, 255));

        // fill the top-right quarter in green
        BlendRect(image[0..32, 32..64], byte4(0, 255, 0, 255));

	// fill the bottom-left quarter in blue
        BlendRect(image[32..64, 0..32], byte4(0, 0, 255, 255));

        // fill the bottom-right quarter in yellow
        BlendRect(image[32..64, 32..64], byte4(255, 255, 0, 255));

        // blend a 50% alpha white square over the middle
        BlendRect(image[16..48, 16..48], byte4(255, 255, 255, 128));

	// The calls to the parallel BlendRect function don't execute until the parallel block { } exits
        // and then they execute in parallel with automatic dependency checking.

        // In this example the first 4 squares will be filled in parallel and the final one will not happen until
        // they are all complete because it overlaps them all
    }

    return 0;
}

// fill an alpha blended rectangle
parallel void BlendRect(byte4[,] local rect, byte4 colour)
{
    // expand source colour to 16 bits per channel
    short4 srcColour = colour;

    // foreach on a 2d array gives each row in turn as a 1d array
    foreach (byte4[] row; rect)
    {
        // foreach on a 1d array yields each element in turn
        foreach (byte4& pixel; row)
        {
            // expand destination colour to 16 bits per channel
            short4 dstColour = *pixel;

            // perform blend - vector component swizzling and write masking is supported like glsl
            pixel.rgb = cast(byte4)(dstColour + ((srcColour - dstColour) * srcColour.aaaa) >> 8);
        }
    }
}



PARTNERS