Had anyone use CUDA and a GPU to build a multithreaded server?

Started by
6 comments, last by wodinoneeye 14 years, 1 month ago
Hi mates! I found this on NVIDIA CUDA resource website: "http://maxime.aega.co.uk/paper/massively-parallel%20game%20servers.pdf" It is a suggestionn of a multithreaded server using a GPU to improve performance. What do you think about that? Are already that kind of servers being used in industry? How can you use sockets inside a shader processing unit? BTW, what is the max amount of conections in a single computer? perhaps 65536 (the number of ports) Thanks a lot.
I've seen things you people wouldn't believe. Attack ships on fire off the shoulder of Orion. I watched C-beams glitter in the dark near the Tannhauser gate. All those moments will be lost in time, like tears in rain. Time to die.
Advertisement
Quote:Original post by ricardo_ruiz_lopez
Are already that kind of servers being used in industry?

Ones that require a lot of branching and non-data-parallel code.

Quote:"http://maxime.aega.co.uk/paper/massively-parallel%20game%20servers.pdf"

That article doesn't really do anything for networking that IOCP doesn't already. It is just a way to use GPU threads as connection handlers.

Quote:How can you use sockets inside a shader processing unit?

You can't. The only thing CUDA can use are the most basic arithmetic operations. There is some minimal support for conditional execution. This limits the usability considerably, especially since scaling traffic is not a matter of brute force, but of smart prioritization and throttling on per-client basis. For example, even if your server can now handle 8000 players in same room, to update them, one would need to send several megabytes of data to each of them, several times per second.

In other words, as far as sockets and socket handling goes, GPU is completely irrelevant. If however you wish to perform some heavy data parallel calculations, results of which are periodically sent over network, then you can. But GPU and networking don't have anything in common.

Quote:BTW, what is the max amount of conections in a single computer? perhaps 65536 (the number of ports)

Depends on kernel, OS version, number of IPs, etc...

I know that some 6-7 years ago single eDonkey server (one physical machine, PC) was handling on the order of ~400,000 connections.
That document is only doing one test, which seems to be "for every player, check what players they can see." That itself is a pretty bad idea and has O(n^2) complexity. You'll usually use a spatial system that allows you to keep track of who to send to with very few checks that return false.

Also, once you do that test, then what? If you plan on putting most all the user updating logic into the GPU, and it can be run in parallel, then sure, that'd probably work. But MMOs are not about complex and heavy independent algorithms, they are about huge amounts of relatively simple conditionals over a large world.

In the workings of most modern MMOs, I don't see GPUs being much help at all. It'd be cheaper and easier to just buy more servers or better CPUs. A MMO server is probably one of the least ideal places GPU computing could be of benefit. And this article proves nothing except for that a GPU can do an O(n^2) loop in parallel faster than a CPU for a large "n".
NetGore - Open source multiplayer RPG engine
Quote:Original post by Spodi
That document is only doing one test, which seems to be "for every player, check what players they can see." That itself is a pretty bad idea and has O(n^2) complexity. You'll usually use a spatial system that allows you to keep track of who to send to with very few checks that return false.


">Well... . Brute force does help if you have 30 million particles.

The problem with that article is that it doesn't consider that network input is serial - so ability to handle 2 million players in parallel doesn't really help - their inputs will arrive one after another, or you wait for x ms, losing the lowered latency benefit.

And for any non-trivial pathfinding, or meshes that would warrant such brute force, CUDA isn't all that great. Unless some new algorithms appeared I'm not familiar with.
Huh? What does the number of ports have to do with anything? Just because a building address is 123 Main Street doesn't mean only one person can be in that house, right?

A TCP connection is identified by the four-tuple (local ip, local port, remote ip, remote port). Thus, the maximum number of connections from a given client host to a given server ip/port is the number of client IP interfaces times the number of ports. You can have a million connections going into a single port on a single box, as long as each of the clients have a unique IP/port pair, and you have enough RAM in your box.


enum Bool { True, False, FileNotFound };
I have not used CUDA but I am currently working on a game server and I am attempting to write it in a massively parallell way. It's actually just part of what could be an MMO Server but its more of a toy project meant to experiment with a few ideas I have.

The part I'm working on is a system where the client connects to the server and then the server keeps track of what each client needs to be aware of(other clients at the moment). As the clients move, then the server update its position and sends out notifications to the appropriate clients. The idea is to make most of the other logic distributed. Of course this will come with another set of problems regarding cheating, synchronization, etc, but since it's an experiment I can ignore such issues.

I have an algorithm designed and partially implemented on the CPU using a thread pool and I intend on implementing an OpenCL version and to compare the performance between the two. I believe that there could be speedups gained by performing the work on the GPU. When doing so, as with any code that is designed to work in a massively parallell environment, it will be important to design the algorithm in such a way to avoid needing locking. My current algorithm is made up of several phases with barriers between each phase. This way there isn't any locking required during the phases(there aren't any cases where threads would be trying to write the same data as another thread). I am hoping that this will allow the algorithm to scale well(either on the GPU or multiple CPU's).

I do have a few issues with the article. As has already been pointed out, the O(n^2) algorithm is a bad idea. To be a fair comparison, it needs to a be a real world system implemented with optimal algorithms for both platforms. It is also a bit misleading when talking about not having a negative performance impact with parallel threads. I may be mistaken but I believe the memory model used by OpenCL/CUDA allows for memory that can be shared by multiple threads or even core of the GPU(although another thread may not "see" a change instantly). If a GPU algorithm requires synchronization between threads it will cost performance. GPU algorithms typically have to written in a different way than algorithms designed to run on a single thread to take advantage of the benefits.

I do think there is alot of potential in GPGPU's but they aren't as magical and some people claim.
The argument is pretty simple:

1. Some servers use algorithm X.
2. Algorithm X can be accelerated using GPGPU methods.
3. Ergo, some servers can be accelerated using GPGPU methods.

The main problem is that finding an "X" for which servers really are compute bound is kind-of hard. The low hanging fruit might be kinetic games with server-side physics simulation. I don't think the general rigid body with collision problem has been GPGPU-ed yet in any usable form, though. Note that PhysX, last I checked, only accelerates the cloth/particle/fluid physics, not the meant-and-potatoes rigid bodies.
enum Bool { True, False, FileNotFound };


You could probably do O(2) N^2 type operations like collision check (or even box checks) between objects to map out required interaction subsets.

This would be only a small part of the game processing but would do basic mass culling (and classifying LOD of interactions) to determine what objects get the more detailed/complex (irregular) processing done by the conventional CPU cores.

Games with mass projectile movement would increase the N counts and could do stepped segment collision tests with other game objects (line intersect triangle..). I suppose 3D collision against a map mesh could likewise be done (as a object move validation on the server side as long as the terrain didnt get too many triangles (probably only small or simple large maps will fit)

The problem with all GPUs so far is that they HAVE TO do things in a certain granularity of parallelism (like 8 sets of data streams thru the same set of instructions with fairly limited IF-THEN operations. That means that only fairly simple parallel type operations will get the high performance gains. The ones above are of that nature, but depending on the game may only be a tiny portion of the servers processing.
--------------------------------------------[size="1"]Ratings are Opinion, not Fact

This topic is closed to new replies.

Advertisement