Jump to content

  • Log In with Google      Sign In   
  • Create Account

Not dead...

To The Metal.

Posted by , in low level, scripting 12 June 2011 - - - - - - · 371 views

I'm trying to recall when I first got into programming, it probably would have been in the window of 5 to 7, we had a BBC Micro at home thanks to my dad's own love of technology and I vaguely recall writing silly BASIC programs at that age; certainly my mum has told me stories about how at 5 I was loading up games via the old tape drive something she couldn't do *chuckles* but it was probably around 11 when I really got into it after meeting someone at high school who knew basic on the Z80 and from there my journey really began.

(In the next 10 years I would go on to exceed both my dad and my friend when it came to code writing ability, so much so that the control program for my friend's BSc in electronics was written by me in the space of a few hours when he couldn't do it in the space of a few months ;) )

One of the best times I had however during this time was when I got an Atari STe and moved from using STOS Basic to 68K Assembly so that I could write extensions for STOS to speed operations up. I still haven't had a coding moment which has given me as much joy as the time I spent 2 days crowbaring a 50Khz mod replay routine into an extension, I lack words for the joy which came when that finally came to life and played back the song without crashing :D You think you've got it hard these days with compile-link-run debug cycles? Try doing a one of them on a 8Mhz computer with only floppy drives to load from and only being able to run one program at a time ;)

The point to all this rambling is that one of the things I miss on modern systems with our great big optimising compilers is the 'to the metal' development when was the norm back then; and I enjoyed working at that level.

In my last entry I was kicking around the idea of making a new scripting language which would natively know about certain threading constraints (only read remote, read/write private local) and I was planning to backend it onto LLVM for speed reasons.

During the course of my research into this idea I came across a newsgroup posting by the guy behind LuaJIT talking about why LuaJIT is so fast. The long and the short of it is this;

A modern C or C++ compiler suite is a mass of competting heuristics which are tuned towards the 'common' case and for that purpose generally work 'well enough' for most code. However an interpreter ISN'T most code, it has very perticular code which the compiler doesn't deal well with.

LuaJIT gets its speed from the main loop being hand written in assembler which allows the code to do clever things that a C or C++ compiler wouldn't be able to do (such as decide what variables are important enough to keep in registers even if the code logic says otherwise).

And that's when a little light went on in my head and I thought; hey.. you know what, that sounds like fun! A low level, to the metal, style of programming which some decent reason for doing it (aka the compier sucks at making this sort of code fast).

At some point in the plan I decided that I was going to do x64 support only. The reasons for this are two fold;

1) It makes the code easier to do. You can make assumptions about instructions and you don't have to deal with any crazy calling conventions as the x64 calling convention is set and pretty sane all things considered.

2) x86 is a slowly dying breed and frankly I have no desire to support it and contort what could be some nice code into some horrible mess to get around the lack of registers and crazy calling conventions.

I've spent the better part of today going over alot of x86/x64 stuff and I now know more about x86/x64 instruction decoding and x64 function calling/call stack setup then any sane person should... however it's been an intresting day :)

In fact x64 has some nice features which would aid the speed of the development such as callers seting up the stack for callees (so tail calls become easy to do) and passing things around in registers instead of via the stack. Granted, while inside the VM I can always do things 'my way' to keep values around as needed but it's worth considering the x64 convention to make interop that much easier.

The aim is to get a fully functional language out of this, once which can interop with C and C++ (calling non-virtual member functions might be the limit here) functions and have some 'safe' threading functionality as outlined in the previous entry.

Granted, having not written a single line of code yet that is some way off to say the least :D So, for now, my first aim is to get a decoding loop running which can execute the 4 'core' functions of any language;

- move
- add
- compare
- jump

After that I'll see about adding more functionality in; the key thing here is designing the ISA in such as way that extending it won't horribly mess up decode and dispatch times.

Oh, and for added fun as the MSVC x64 compiler doesn't allow inline assembly large parts are going to be fully hand coded... I like this idea ^_^

Kicking about an idea...

Posted by , in scripting 26 May 2011 - - - - - - · 272 views

Scripting languages, such as Lua and Python, are great.

They allow you to bind with your game and quickly work on ideas without the recompile-link step as you would with something like C++ in the mix.

However in a highly parrallel world those languages start to look lacking as they often have a 'global' state which makes it hard to write code which can execute across multiple threads in the langauge in question (I'm aware of stackless python, and I admit I've not closely looked at it), certainly when data is being updated.

This got me thinking, going forward a likely common pattern in games to avoid locks is to have a 'private' and 'public' state of objects which allows loops which look like this;

[update] -> [sync] -> [render]

or even

[update] -> [render] -> [sync]

Either way that 'sync' step can be used, in a parallel manner, to move 'private' state to be publical visable so that during the 'update' phase other objects can query and work with it.

Of course to do this effectively you'd have to store two variables, one for private and one for public state, and deal with moving it around which is something you don't really want to be doing.

This got me thinking, about about if you could 'tag' elements as 'syncable' in some way and have the scripting back end take care of the business of state copying and, more importantly, context when those variables were active. Then, when you ran your code the runtime would figure out, based on context, which copy of the state it had to access for data.

There would still need to be a 'sync' step called in order to have the run time copy the private data to the public side, which would have to be user callable as it would be hard for the runtime to know when it was 'safe' to do so but it would remove alot of the problem as you would only declare your variables once and your functions once and the back end would figure it out. (You could even use a system like Lua's table keys where you can make them 'weak' by setting a meta value on them so values could be added to structures at runtime). The sync step could also use a copy-on-write setup so that if you don't change a value then it doesn't try to sync it.

It needs some work, as ideas go, to make it viable but I thought I'd throw the rough idea out for some feedback, see if anyone has any thoughts on it all.

On APIs.

Posted by , in NV, OpenGL, OpenCL, DX11, AMD 23 March 2011 - - - - - - · 873 views

Right now 3D APIs are a little... depressing... on the desk top.

While I still think D3D11 is technically the best API we have on Windows the fact that AMD and NV currently haven't implimented multi-threaded rendering in a manner which helps performance is annoying. I've heard that there are good technical reasons why this is a pain to do, I've also heard that right now AMD have basically sacked it off in favour of focusing on the Fusion products. NV are a bit further along but in order to make use of it you effectively give up a core as the driver creates a thread which does the processing.

At this point my gaze turned to OpenGL, and with OpenGL4.x while the problems with the API are still there in the bind-to-edit model which is showing no signs of dying feature wise it is to a large degree caught up. Right now however there are a few things I can't see a way of doing from GL, but if anyone knows differently please let me know...

  • Thread-free resource creation. The D3D device is thread safe in that you can call its resource recreation routines from any thread. As far as I know GL still needs to use a context which must be bound to the 'current' thread to create resources.
  • Running a pixel shader at 'sample' frequency instead of pixel frequency. So, in an MSAA x4 render target we would run 4 times per pixel
  • The ability to write to a structured memory buffer in the pixel shader. I admit I've not looked too closely at this but a quick look at the latest extension for pixel/fragment shaders doesn't give any clues this can be done.
  • Conservative depth output. In D3D a shader can be tagged in such a way that it'll never output depth greater than the fragment was already at, which will conserve early-z rejection and allow you to write out depth info different to that of the primative being draw.
  • Forcing early-z to run; when combined with the UAV writing above this allows things like calculating both colour and 'other' information per-fragment and only have both written if early-z passes. Otherwise UAV data is written when colour isn't.
  • Append/consume structured Buffers; I've not spotted anything like this anyway. I know we are verging into compute here which is OpenCL but Pixel Shaders can use them

There are probably a few others which I've missed, however these spring to mind and, many of them, I want to use.

OpenGL also still has the 'extension' burden around it's neck with GLee out of date and GLEW just not looking that friendly (I took a look at both this weekend gone). In a way I'd like to use OpenGL because it works nicely with OpenCL and in some ways the OpenCL compute programming model is nicer than the Compute model but with apprently API/hardware features missing this isn't really workable.

In recent weeks there has been talk of ISVs wanting the 'API to go away' because (among other things) it costs so much to make a draw call on the PC vs Consoles; while I somewhat agree with the desire to free things up and get at the hardware more one of the reasons put forward for this added 'freedom' was to stop games looking the same, however in a world without APIs where you are targetting a constantly moving set of goal posts you'll see more companies either drop the PC as a platform or license an engine to do all that for them.

While people talk about 'to the metal' programming being a good idea because of how well it works on the consoles they seem to forget it often takes half a console life cycle for this stuff to become used/common place and that is targetting fixed hardware. In the PC space things change too fast for this sort of thing; AMD themselves in one cycle would have invalidated alot of work by going from VLIW5 to VLIW4 between the HD5 and HD6 series, never mind the underlaying changes to the hardware itself. Add into this the fact that 'to the metal' would likely lag hardware releases and you don't have a compelling reason to go that route, unless all the IHVs decide to go with the same TTM "API" at which point things will get.. intresting (see; OpenGL for an example of what happens when IHVs try to get along.).

So, unless NV and AMD want to slow down hardware development so things stay stable for multiple years I don't see this as viable at all.

The thing is SOMETHING needs to be done when it comes to the widening 'draw call gap' between consoles and PCs. Right now 5 year old hardware can out perform a cutting edge system when it comes to CPU cost of draw calls; fast forward 3 year to the next generation of console hardware which is likely to have even more cores than now (12 min. I'd guess), faster ram and DX11+ class GPUs as standard. Unless something goes VERY wrong then this hardware will likely allow trivial application of command list/multi-threaded rendering further openning the gap between the PC and consoles.

Right now PCs are good 'halo' products as they allow devs to push up the graphics quality settings and just soak up the fact we are being CPU limited on graphics submissions due to out of order processors, large caches and higher clock speeds. But clock speeds have hit a wall and when the next generation of consoles drops they will match single threaded clock speed and graphics hardware... suddenly the pain of developing on a PC, with its flexible hardware, starts to look less and less attractive.

For years people have been saying about the 'death of PC gaming' and the next generation of hardware could well cause, if not that, then the reduction of the PC to MMO, RTS, TBS and 'facebook' games while all the large AAA games move off to the consoles where development is easier, rewards are greater and things can be pushed futher.

We don't need the API to 'go away' but it needs to become thinner, both on the client AND the driver side. MS and the IHVs need to work together to make this a reality because if not they will all start to suffer in the PC space. Of course, with the 'rise in mobile' they might not even consider this an issue..

So, all in all the state is depressing.. too much overhead, missing features and in some way doomed in the near future...

Basic Lua tokenising is go...

Posted by , in Lua 16 January 2011 - - - - - - · 359 views

Over a few weekends leading up until Xmas and the last couple since then I have been playing around with boost::spirit and taking a quick look at ANTLR in order to setup some code to parse Lua and generate an AST.

Spirit looked promising, the ability to pretty much dump the Lua BNF into it was nice right up until I ran into Left Recursion and ended up in stack overflow land. I then went on a hunt for an existing example but that failed to compile, used all manner of boost::fusion magic and was generally a pain to work with.

I had a look at ANTLR last weekend and while dumping out a C++ parser using their GUI tool was easy enough the C++ docs are... lacking.. it seems and I couldn't make any headway when it came to using it.

This afternoon I decided to bite the bullet and just start doing it 'by hand'. Fortunately the Lua BNF isn't that complicated with a low number of 'keywords' to deal with and a syntax which shouldn't be too hard to build into a sane AST from a token stream.

I'm not doing things completely by hand; the token extraction is being handled by boost::tokeniser with a custom written skipper which dumps space and semi-colons, keeps the rest of the punctuation required by Lua and, importantly, it aware of floating point/double numbers so that it can correctly spit out a dot as a token when it makes sense.

Currently it doesn't deal with/hasn't been tested with octal or escaped characters and comments would probably cause things to explode, however I'll deal with them in the skipper at some point.

Given the following Lua;

foo = 42; bar = 43.4; rage = {} rage:fu() rage.omg = "wtf?"

The following token stream is pushed out;

<foo (28)> <= (18)> <42 (30)>
<bar (28)> <= (18)> <43.4 (29)>
<rage (28)> <= (18)> <{ (20)> <} (21)>
<rage (28)> <: (26)> <fu (28)> <( (24)> <) (25)>
<rage (28)> <. (27)> <omg (28)> <= (18)> <"wtf?" (31)>

Where the number is the token id found

There is a slight issue right now, such as when given this code;

foo = 42; bar <= 43.4; rage = {} rage:fu() rage.omg = "wtf?"

The token stream created is;

<foo (28)> <= (18)> <42 (30)>
<bar (28)> << (26)> <= (18)> <43.4 (29)>
<rage (28)> <= (18)> <{ (20)> <} (21)>
<rage (28)> <: (26)> <fu (28)> <( (24)> <) (25)>
<rage (28)> <. (27)> <omg (28)> <= (18)> <"wtf?" (31)>

Notice that it create two tokens for the '<=' sequence; this will probably need to be solved in the skipper as well.

So, once that is solved the next step will be the AST generation.. fun times...

On HD5870 and Memory

Posted by , in OpenCL, DX11, AMD 15 January 2011 - - - - - - · 319 views
DX11, OpenCL, AMD, HD5870
While gearing up to work on parser/AST generator as mentioned in my previous entry I decided to watch a couple of Webinars from AMD talking about OpenCL (because while I'm DX11 focused I do like OpenCL as a concept); the first of which was talking about the HD5870 design.

One of the more intresting things to come out of it was some details on the 'global data store' (GDS), wihch while only given an overview had an intresting nugget of information in it which would have been easy to skip over.

While not directly exposed in DXCompute or OpenCL the GDS does come into play with DXCompute's 'appendbuffers' (and whatever the OpenCL version of this same construct is) as that is where the data is written to thus allowing the GPU to accelerate the process.

What this means in real terms is that if you compute shader needs to store memory which everyone in dispatch needs to get to for some reason then you could use these append buffers with only a small hit (25 cycles on the HD5 series) as long as the data will fit into the memory block. Granted, you would still need to place barriers into your shader/OpenCL code to ensure that everyone is done writing but it might allow for faster data sharing in some situations.

I don't know if NV does anything simular however, maybe I'll check that out later..

Right, back to the Webinars...

Recent Entries

Recent Comments