Jump to content

  • Log In with Google      Sign In   
  • Create Account

#ActualHodgman

Posted 05 March 2013 - 02:39 AM


Sorry, I meant that everyone's GPU code is JIT compiled.
You control the GPU via an API like D3D, and shader code (e.g. HLSL) which is compiled into an intermediate bytecode format.

Of course. That's still an AOT compiler, not a JIT, as it compiles the shader before it runs for the first time. I was referring to the merits of a true profiling JIT, one that performs optimizations based on the running profile of the code. I thought that somehow you've done that with shaders.


The actual code that's run on the GPU depends on both the shader bytecode and the command stream generated from the API.
e.g. when compiling shader bytecode, the driver might have a choice to optimize for reduced register usage (size), or reduced ALU operations (speed). If the compiler guesses that this shader will be memory-bound because it contains many fetch instructions, then it may choose to optimize for size over speed, because this will allow more threads to be in flight at once (the GPU's version of hyperthreading is flexible in the number of "hardware threads" that can be run, depending on the register usage of the current shader).
However, whether this guess is correct or not depends on the command stream. The API can configure the inputs such that all of the memory fetches will read the same byte of memory, resulting in an epic caching win and no memory bottleneck, or it can configure the inputs such that every memory access is random an uncacheable. If the former, the compiler should have optimized for speed, and if the latter it should have optimized for size.
There's many other examples like this -- e.g. when two different shader programs are executing simultaneously on different sub-sets of the GPUs resources, the best optimization for a shader method may depend on what the "neighbouring" shader is.
Apart from these kinds of JIT-to-optimize cases, there's also times when the driver is outright required to modify/recompile the code -- e.g. depending on what kind of texture you've bound via the API, the instructions to fetch and decode floating point values from it will differ (e.g. for byte-textures you need a "divide by 255"); some GPUs have specialized hardware for this that are configured by the API, while others require the shader assembly to be recompiled with the appropriate instructions based on the API state.
 

Is [middleware actually popular] (I'm asking seriously)? It is certainly true on the client side with game engines, but which server-side frameworks are in common use in games?

Game engines usually aren't just a client-side only thing.
For a single-player game, the client machine has to run both the game client and the game server.
In a peer-to-peer game, the same is true, as often every client is acting as a shared server.
In a client-server game, a popular feature is to allow one player to act as a server, instead of requiring a dedicated server machine.

Given this, it's helpful if you can just write the game once, in such a way where it can be used in all of the above models, by simply changing some replication and authority policies. High-end game engines contain a lot of utilities to help write these kind of "simultaneously client-only/server-only/client-is-server" code-bases.
So game engines are both client middleware and server middleware (and often you'll supplement an engine with a lot of extra middleware -- the big engines often provide integration/glue code for other popular middleware packages to make it easy to use them alongside their engine).
 
Often as well as the game client and server, there is a "master server", which players connect to first in order to find a particular server.
e.g. a game server hosts 64 players at once and runs a game simulation, while the master server hosts millions of players at once, but only runs chat and matchmaking logic.
In this space, you'll definitely find a lot of standard technologies like MySQL, memcached, JSON, REST API's over HTTP, Comet/Push web stuff, MongoDB, etc...
 
MMOs are similar to the latter group, but I've never worked for an MMO developer so I can't comment on their tech.

In the MMO-space, I know of quite a few that are largely written in Python (e.g. EVE, any Big World ones), and this has come back to bite them in the age of concurrency ("global interpreter lock"...). Eve got blasted recently for being mainly single-core, and part of their official response said (paraphrasing) that multi-core isn't just something we can switch on, and python makes the transition especially hard on us. So those stories do fit with your anonymous MMO anecdote about finding it hard to move away from the single-threaded game-loop model wink.png
MMOs have largely been confined to the PC realm though, so unlike console developers, they weren't forced to make the multi-core transition back when multi-core consoles became mainstream.
 

BTW, what actor implementation have you used?

A proprietary (read: NIH) one, because the ones we looked at weren't suitable for games wink.png
It worked by:
* It operated in 'cycles' where a groups of messages were executed, which would result in objects being created/destroyed/being-sent-messages. Each sim frame, the actor model would continue to run these cycles continuously until no more messages were produced, at which point it would begin the next sim frame.
* Every thread had a thread-local buffer where messages could be written. Whenever you called a method on an actor, the args would be written to the current thread's buffer and a future returned.
* Futures didn't have any kind of complex waiting logic -- the system knew immediately how many cycles in the future the value would be ready. The only way to 'resolve' a future into a value was to use it as an argument to a message. That message would be delayed the appropriate number of cycles so that it wasn't executed until after it's arguments existed.
* At the end of each cycle:
*** A new buffer full of messages has been produced by each thread. These were merged into one list, and sorted by actor-ID (and a 2nd sorting key to provide determinism via message priority). That sorted queue was then split into one non-overlapping range of actors per thread, to be executed next cycle.
*** Any objects who's reference count had been decremented would be checked for garbage collection / destruction.
*** Any new objects would have their IDs assigned in a deterministic manner.
 
It was nice simply because it let people, for the most part, continue writing in a style similar to the traditional OOP that they were used to, but the execution would be split across any number of cores (up to the number of active Actors), while still remaining completely deterministic, and with very little synchronisation between threads (most communication was wait-free in optimal conditions, even reference-counting was wait-free and without cache-contention).
 

I'm not sure this property [determinism] is always so important (or important at all), though.

Determinism can be massively important!
A massive part of game development is spent debugging, and a deterministic sim makes reproducing complex cases very simple -- someone who has produced the error can just save their inputs-replay file. Anyone can then replay this file to see how the error case developed over time, breaking execution and inspecting state at the error itself, but also in the lead up to the error.
 
Many games rely on deterministic simulation in order to even be feasible from a networking point of view. e.g. RTS games are pretty much the only genre where you can actually have several thousand player controlled entities interacting in a world. The usual methods of state-replication produce ridiculously high bandwidth requirements when applied to these situations, so instead, it's extremely common for multiplayer RTS games to instead use a lock-step simulation where all the players only share their inputs.
The bandwidth between keyboard and chair is very low, so networking based on this input means that you can have an unlimited number of units on the battlefield with a fairly constant network bandwidth requirement. Players simply buffer and share their inputs, and then apply them at agreed upon times, resulting in the same simulation on all clients.

Also, many console games don't have the luxury of dedicated servers. Instead, they often use the one-client-is-the-server model. This means that the server's outward bandwidth is restricted to whatever the upload bandwidth is on a residential DSL connection (which is pretty low). This makes networking strategies based around determinism and input sharing quite popular, due to the low bandwidth requirements.

For a general-purpose game engine that can be used for any genre, it's pretty much required to support determinism in case the end user (game developer) requires it. To not support it, is to cut off a percentage of your clients.

[edit missed this one]

Which frameworks are used for [stream processing]?

I'm not aware of any popular stream processing middleware, the ones I've been exposed to have either been a component of a game engine, or a home-grown proprietary solution. I guess one of the nice things about stream processing is that you don't need a big complicated framework to implement it, so the damage of NIH isn't as bad.


On a tangent, the difference between (pejorative) "frameworks" and (simple) "libraries" is an interesting one. Technically, they're often both just a code library, but a "framework" usually requires that all the code that you write when using it conforms to it's particular worldview. e.g. every class has to follow some pattern, inherit some base, provide some overload, be mutated in some certain way, etc, etc... You can usually tell when you try and use two different "frameworks" together, and end up in a hell of a conflicting mess. Game engines often fit this category -- they take on so many responsibilities that everything you do is tightly coupled to their way of doing thigs...
In some of the projects that I've worked on, these kinds of frameworks have caused a lot of internal friction, so I've seen many teams resort to attacking code-bases with a figurative axe, breaking "frameworks" down into a large collection of simple, single-responsibility libraries, resulting in a much more flexible code-base.
The reason I bring this up was because the last time I saw this, the instigator was a parallel processing framework, and the need for some "exceptions" to be bolted onto it. Like you said earlier, even functional languages sometimes resort to using the shared-state model! The end result was that we ended up with a few different processing libraries that could be used together easily, without any frameworks tying code down too strongly to a single model.


On the PS3, there is a popular bit of middleware called SPURS, which basically implements a "job" type system -- a job is a function and a bunch of parameters/data-pointers, which you can put into a queue and can later check/wait for it's completion.
Because of this, many game engines use this model as their low level concurrency layer (implementing "job" systems on 360/PC/etc as well). They then often build another layer on top of this job system that allows for stream processing or other models.

One popular game engine that I've used implemented a pretty neat stream processing system where:
* the processing pipeline could be described as data - the kernels would be compiled from this description dynamically.
* the pipeline description was a graph of nodes representing function pointers, and the edges told it how the function's input/output arguments flowed between nodes.
* when compiling, you could tell it how much 'scratch memory'/'node memory' etc it had available, which would affect how it unrolled the loops and constructed kernels etc...
** e.g. maybe it will just call each function in sequence on one bit of data at a time, or maybe it will call function #1 on 64 bits of data then function #2 using those 64 inputs, etc...
* on platforms that allowed for code-generation, it would actually memcpy code around the place to create highly efficient kernels. On other platforms it operated by actually calling function pointers.

You can read a bit about a similar system here: http://www.insomniacgames.com/tech/articles/0907/files/spu_shaders_introduction.pdf

#2Hodgman

Posted 05 March 2013 - 02:31 AM


Sorry, I meant that everyone's GPU code is JIT compiled.
You control the GPU via an API like D3D, and shader code (e.g. HLSL) which is compiled into an intermediate bytecode format.

Of course. That's still an AOT compiler, not a JIT, as it compiles the shader before it runs for the first time. I was referring to the merits of a true profiling JIT, one that performs optimizations based on the running profile of the code. I thought that somehow you've done that with shaders.


The actual code that's run on the GPU depends on both the shader bytecode and the command stream generated from the API.
e.g. when compiling shader bytecode, the driver might have a choice to optimize for reduced register usage (size), or reduced ALU operations (speed). If the compiler guesses that this shader will be memory-bound because it contains many fetch instructions, then it may choose to optimize for size over speed, because this will allow more threads to be in flight at once (the GPU's version of hyperthreading is flexible in the number of "hardware threads" that can be run, depending on the register usage of the current shader).
However, whether this guess is correct or not depends on the command stream. The API can configure the inputs such that all of the memory fetches will read the same byte of memory, resulting in an epic caching win and no memory bottleneck, or it can configure the inputs such that every memory access is random an uncacheable. If the former, the compiler should have optimized for speed, and if the latter it should have optimized for size.
There's many other examples like this -- e.g. when two different shader programs are executing simultaneously on different sub-sets of the GPUs resources, the best optimization for a shader method may depend on what the "neighbouring" shader is.
Apart from these kinds of JIT-to-optimize cases, there's also times when the driver is outright required to modify/recompile the code -- e.g. depending on what kind of texture you've bound via the API, the instructions to fetch and decode floating point values from it will differ (e.g. for byte-textures you need a "divide by 255"); some GPUs have specialized hardware for this that are configured by the API, while others require the shader assembly to be recompiled with the appropriate instructions based on the API state.
 

Is [middleware actually popular] (I'm asking seriously)? It is certainly true on the client side with game engines, but which server-side frameworks are in common use in games?

Game engines usually aren't just a client-side only thing.
For a single-player game, the client machine has to run both the game client and the game server.
In a peer-to-peer game, the same is true, as often every client is acting as a shared server.
In a client-server game, a popular feature is to allow one player to act as a server, instead of requiring a dedicated server machine.

Given this, it's helpful if you can just write the game once, in such a way where it can be used in all of the above models, by simply changing some replication and authority policies. High-end game engines contain a lot of utilities to help write these kind of "simultaneously client-only/server-only/client-is-server" code-bases.
So game engines are both client middleware and server middleware (and often you'll supplement an engine with a lot of extra middleware -- the big engines often provide integration/glue code for other popular middleware packages to make it easy to use them alongside their engine).
 
Often as well as the game client and server, there is a "master server", which players connect to first in order to find a particular server.
e.g. a game server hosts 64 players at once and runs a game simulation, while the master server hosts millions of players at once, but only runs chat and matchmaking logic.
In this space, you'll definitely find a lot of standard technologies like MySQL, memcached, JSON, REST API's over HTTP, Comet/Push web stuff, MongoDB, etc...
 
MMOs are similar to the latter group, but I've never worked for an MMO developer so I can't comment on their tech.

In the MMO-space, I know of quite a few that are largely written in Python (e.g. EVE, any Big World ones), and this has come back to bite them in the age of concurrency ("global interpreter lock"...). Eve got blasted recently for being mainly single-core, and part of their official response said (paraphrasing) that multi-core isn't just something we can switch on, and python makes the transition especially hard on us. So those stories do fit with your anonymous MMO anecdote about finding it hard to move away from the single-threaded game-loop model wink.png
MMOs have largely been confined to the PC realm though, so unlike console developers, they weren't forced to make the multi-core transition back when multi-core consoles became mainstream.
 

BTW, what actor implementation have you used?

A proprietary (read: NIH) one, because the ones we looked at weren't suitable for games wink.png
It worked by:
* It operated in 'cycles' where a groups of messages were executed, which would result in objects being created/destroyed/being-sent-messages. Each sim frame, the actor model would continue to run these cycles continuously until no more messages were produced, at which point it would begin the next sim frame.
* Every thread had a thread-local buffer where messages could be written. Whenever you called a method on an actor, the args would be written to the current thread's buffer and a future returned.
* Futures didn't have any kind of complex waiting logic -- the system knew immediately how many cycles in the future the value would be ready. The only way to 'resolve' a future into a value was to use it as an argument to a message. That message would be delayed the appropriate number of cycles so that it wasn't executed until after it's arguments existed.
* At the end of each cycle:
*** A new buffer full of messages has been produced by each thread. These were merged into one list, and sorted by actor-ID (and a 2nd sorting key to provide determinism via message priority). That sorted queue was then split into one non-overlapping range of actors per thread, to be executed next cycle.
*** Any objects who's reference count had been decremented would be checked for garbage collection / destruction.
*** Any new objects would have their IDs assigned in a deterministic manner.
 
It was nice simply because it let people, for the most part, continue writing in a style similar to the traditional OOP that they were used to, but the execution would be split across any number of cores (up to the number of active Actors), while still remaining completely deterministic, and with very little synchronisation between threads (most communication was wait-free in optimal conditions, even reference-counting was wait-free and without cache-contention).
 

I'm not sure this property [determinism] is always so important (or important at all), though.

Determinism can be massively important!
A massive part of game development is spent debugging, and a deterministic sim makes reproducing complex cases very simple -- someone who has produced the error can just save their inputs-replay file. Anyone can then replay this file to see how the error case developed over time, breaking execution and inspecting state at the error itself, but also in the lead up to the error.
 
Many games rely on deterministic simulation in order to even be feasible from a networking point of view. e.g. RTS games are pretty much the only genre where you can actually have several thousand player controlled entities interacting in a world. The usual methods of state-replication produce ridiculously high bandwidth requirements when applied to these situations, so instead, it's extremely common for multiplayer RTS games to instead use a lock-step simulation where all the players only share their inputs.
The bandwidth between keyboard and chair is very low, so networking based on this input means that you can have an unlimited number of units on the battlefield with a fairly constant network bandwidth requirement. Players simply buffer and share their inputs, and then apply them at agreed upon times, resulting in the same simulation on all clients.

Also, many console games don't have the luxury of dedicated servers. Instead, they often use the one-client-is-the-server model. This means that the server's outward bandwidth is restricted to whatever the upload bandwidth is on a residential DSL connection (which is pretty low). This makes networking strategies based around determinism and input sharing quite popular, due to the low bandwidth requirements.

For a general-purpose game engine that can be used for any genre, it's pretty much required to support determinism in case the end user (game developer) requires it. To not support it, is to cut off a percentage of your clients.

Which frameworks are used for [stream processing]?

I'm not aware of any popular stream processing middleware, the ones I've been exposed to have either been a component of a game engine, or a home-grown proprietary solution. I guess one of the nice things about stream processing is that you don't need a big complicated framework to implement it, so the damage of NIH isn't as bad.

On the PS3, there is a popular bit of middleware called SPURS, which basically implements a "job" type system -- a job is a function and a bunch of parameters/data-pointers, which you can put into a queue and can later check/wait for it's completion.
Because of this, many game engines use this model as their low level concurrency layer (implementing "job" systems on 360/PC/etc as well). They then often build another layer on top of this job system that allows for stream processing or other models.

One popular game engine that I've used implemented a pretty neat stream processing system where:
* the processing pipeline could be described as data - the kernels would be compiled from this description dynamically.
* the pipeline description was a graph of nodes representing function pointers, and the edges told it how the function's input/output arguments flowed between nodes.
* when compiling, you could tell it how much 'scratch memory'/'node memory' etc it had available, which would affect how it unrolled the loops and constructed kernels etc...
** e.g. maybe it will just call each function in sequence on one bit of data at a time, or maybe it will call function #1 on 64 bits of data then function #2 using those 64 inputs, etc...
* on platforms that allowed for code-generation, it would actually memcpy code around the place to create highly efficient kernels. On other platforms it operated by actually calling function pointers.

You can read a bit about a similar system here: http://www.insomniacgames.com/tech/articles/0907/files/spu_shaders_introduction.pdf

#1Hodgman

Posted 04 March 2013 - 09:03 PM


Sorry, I meant that everyone's GPU code is JIT compiled.
You control the GPU via an API like D3D, and shader code (e.g. HLSL) which is compiled into an intermediate bytecode format.

Of course. That's still an AOT compiler, not a JIT, as it compiles the shader before it runs for the first time. I was referring to the merits of a true profiling JIT, one that performs optimizations based on the running profile of the code. I thought that somehow you've done that with shaders.


The actual code that's run on the GPU depends on both the shader bytecode and the command stream generated from the API.
e.g. when compiling shader bytecode, the driver might have a choice to optimize for reduced register usage (size), or reduced ALU operations (speed). If the compiler guesses that this shader will be memory-bound because it contains many fetch instructions, then it may choose to optimize for size over speed, because this will allow more threads to be in flight at once (the GPU's version of hyperthreading is flexible in the number of "hardware threads" that can be run, depending on the register usage of the current shader).
However, whether this guess is correct or not depends on the command stream. The API can configure the inputs such that all of the memory fetches will read the same byte of memory, resulting in an epic caching win and no memory bottleneck, or it can configure the inputs such that every memory access is random an uncacheable. If the former, the compiler should have optimized for speed, and if the latter it should have optimized for size.
There's many other examples like this -- e.g. when two different shader programs are executing simultaneously on different sub-sets of the GPUs resources, the best optimization for a shader method may depend on what the "neighbouring" shader is.
Apart from these kinds of JIT-to-optimize cases, there's also times when the driver is outright required to modify/recompile the code -- e.g. depending on what kind of texture you've bound via the API, the instructions to fetch and decode floating point values from it will differ (e.g. for byte-textures you need a "divide by 255"); some GPUs have specialized hardware for this that are configured by the API, while others require the shader assembly to be recompiled with the appropriate instructions based on the API state.
 

Is [middleware actually popular] (I'm asking seriously)? It is certainly true on the client side with game engines, but which server-side frameworks are in common use in games?

Game engines usually aren't just a client-side only thing.
For a single-player game, the client machine has to run both the game client and the game server.
In a peer-to-peer game, the same is true, as often every client is acting as a shared server.
In a client-server game, a popular feature is to allow one player to act as a server, instead of requiring a dedicated server machine.

Given this, it's helpful if you can just write the game once, in such a way where it can be used in all of the above models, by simply changing some replication and authority policies. High-end game engines contain a lot of utilities to help write these kind of "simultaneously client-only/server-only/client-is-server" code-bases.
So game engines are both client middleware and server middleware (and often you'll supplement an engine with a lot of extra middleware -- the big engines often provide integration/glue code for other popular middleware packages to make it easy to use them alongside their engine).
 
Often as well as the game client and server, there is a "master server", which players connect to first in order to find a particular server.
e.g. a game server hosts 64 players at once and runs a game simulation, while the master server hosts millions of players at once, but only runs chat and matchmaking logic.
In this space, you'll definitely find a lot of standard technologies like MySQL, memcached, JSON, REST API's over HTTP, Comet/Push web stuff, MongoDB, etc...
 
MMOs are similar to the latter group, but I've never worked for an MMO developer so I can't comment on their tech.

In the MMO-space, I know of quite a few that are largely written in Python (e.g. EVE, any Big World ones), and this has come back to bite them in the age of concurrency ("global interpreter lock"...). Eve got blasted recently for being mainly single-core, and part of their official response said (paraphrasing) that multi-core isn't just something we can switch on, and python makes the transition especially hard on us. So those stories do fit with your anonymous MMO anecdote about finding it hard to move away from the single-threaded game-loop model wink.png
MMOs have largely been confined to the PC realm though, so unlike console developers, they weren't forced to make the multi-core transition back when multi-core consoles became mainstream.
 

BTW, what actor implementation have you used?

A proprietary (read: NIH) one, because the ones we looked at weren't suitable for games wink.png
It worked by:
* It operated in 'cycles' where a groups of messages were executed, which would result in objects being created/destroyed/being-sent-messages. Each sim frame, the actor model would continue to run these cycles continuously until no more messages were produced, at which point it would begin the next sim frame.
* Every thread had a thread-local buffer where messages could be written. Whenever you called a method on an actor, the args would be written to the current thread's buffer and a future returned.
* Futures didn't have any kind of complex waiting logic -- the system knew immediately how many cycles in the future the value would be ready. The only way to 'resolve' a future into a value was to use it as an argument to a message. That message would be delayed the appropriate number of cycles so that it wasn't executed until after it's arguments existed.
* At the end of each cycle:
*** A new buffer full of messages has been produced by each thread. These were merged into one list, and sorted by actor-ID (and a 2nd sorting key to provide determinism via message priority). That sorted queue was then split into one non-overlapping range of actors per thread, to be executed next cycle.
*** Any objects who's reference count had been decremented would be checked for garbage collection / destruction.
*** Any new objects would have their IDs assigned in a deterministic manner.
 
It was nice simply because it let people, for the most part, continue writing in a style similar to the traditional OOP that they were used to, but the execution would be split across any number of cores (up to the number of active Actors), while still remaining completely deterministic, and with very little synchronisation between threads (most communication was wait-free in optimal conditions, even reference-counting was wait-free and without cache-contention).
 

I'm not sure this property [determinism] is always so important (or important at all), though.

Determinism can be massively important!
A massive part of game development is spent debugging, and a deterministic sim makes reproducing complex cases very simple -- someone who has produced the error can just save their inputs-replay file. Anyone can then replay this file to see how the error case developed over time, breaking execution and inspecting state at the error itself, but also in the lead up to the error.
 
Many games rely on deterministic simulation in order to even be feasible from a networking point of view. e.g. RTS games are pretty much the only genre where you can actually have several thousand player controlled entities interacting in a world. The usual methods of state-replication produce ridiculously high bandwidth requirements when applied to these situations, so instead, it's extremely common for multiplayer RTS games to instead use a lock-step simulation where all the players only share their inputs.
The bandwidth between keyboard and chair is very low, so networking based on this input means that you can have an unlimited number of units on the battlefield with a fairly constant network bandwidth requirement. Players simply buffer and share their inputs, and then apply them at agreed upon times, resulting in the same simulation on all clients.

Also, many console games don't have the luxury of dedicated servers. Instead, they often use the one-client-is-the-server model. This means that the server's outward bandwidth is restricted to whatever the upload bandwidth is on a residential DSL connection (which is pretty low). This makes networking strategies based around determinism and input sharing quite popular, due to the low bandwidth requirements.

For a general-purpose game engine that can be used for any genre, it's pretty much required to support determinism in case the end user (game developer) requires it. To not support it, is to cut off a percentage of your clients.

PARTNERS