So, I'm assuming that you already have a load balancer, which spreads incoming user connections to some number of nodes (hosts) that run processes which "deal with connected users" (call these "user processes")
Also I assume that those same processes can also talk to each other, presumably on some other port than the main incoming-user-connections port. (And presumably firewalled off!)
Also I assume that you manage many user connections in a single process, because that saves on overhead per-process.
So, far, there exists:
- incoming connections from users
- going to some of some number of user-serving processes
- a mapping between "user" and "user-serving process"
Now, a user wants to create some game instance. I propose that the simplest way to do that is to create a second kind of process, a "game instance process."
I assume that each "game instance process" can manage more than one game instance at the same time -- again, because that's typically how you build Node services.
You would have some function that "selects one game instance node/process, and create a new game instance on it." It would also register that instance in some database.
You then return that game instance ID to the creating player, and the creating player's user-process would make a connection between that player, and the game instance on the game-instance-process server.
Now, when a second user wants to join the same game instance, the user-server-process that manages that user would find the game-instance-process for the game-instance-id, and connect that second user to that process.
In-game chat would go through the game-instances.
If you want to support 'disconnect/reconnect' then you would have another database of "user id" to "game instance currently in," and when a user connects, the user-connection process would look this up, and if it's not empty, immediately (re-)connect the user to the game instance.
If you now want to add arbitrary user-to-user chat, then you need a separate database of "user-id" to "user-server-process."
When user B wants to send a message to user C, user B's user-server-instance will look up user C, and send a message to that user-server-instance.
The main problem with keeping this data in Redis is that, if a process crashes, Redis doesn't clean up after you.
And if you set an expiry time on this data, then you have to keep refreshing the data while the user is connected. Let's say you expire data after 5 minutes -- this means you have to refresh the data every 4 minutes or so, which adds a not insignificant additional write load on your Redis instance.
This is why I prefer something like Zookeeper, which can create "ephemeral" keys, which go away if the connection to Zookeeper that created the key goes away.
But, either way can work.
Now, it turns out that, in most systems like these, each "user-server-process" will have to talk to each "game-instance-process" on average, and if you do cross-system chat, each user-server-process will also talk to each other user-server-process, as well as everything talking to the central database. This will scale as N-squared in number-of-processes. Luckily, because you can typically do thousands of users per process, and N=100 processes still keeps N-squared at a reasonable size, you should be able to do 100,000 online players without too much trouble, and if you make sure to optimize the implementation of the various bits, you can probably do 10,000 users per process and N=1000 processes, to support games that are the largest in the world :-)
Node has the draw-back that you can only run a single thread per process. This means that you'll need to run multiple processes (and thus multiple server-instances) per physical host, to best use available cores. This is generally accomplished by mapping each server-instance to a separate port. Thus, the look-up table to find a particular server-instance, needs to return both a host (internal IP) and a port number. Similarly, the load balancer for incoming user connections will be configured with multiple back-end processes to load balance to, re-writing the publicly exposed port to whatever the internal port number is for each of the instances ("reverse" or "destination" NAT if your LB is a router; just an internal TCP stream if your LB is something like HAProxy.)
One of the best features of Erlang/OTP is that almost all of the features I talk about above (except for the load balancer) are built-into the software already!
You make sure to configure the different Erlang nodes appropriately with their roles, and find each target server using the built-in Erlang process discovery/registry functions, and you'll do great!
With Node, you have to build a bunch of this yourself (as you already discovered.)