Back to General and Gameplay Programming

Communicating changes across a system scalably?

General and Gameplay Programming Programming

Started by Telastyn April 04, 2012 12:22 AM

5 comments, last by ddn3 12 years ago

Telastyn

3,778

Author

April 04, 2012 12:22 AM

tl;dr

How do you communicate database changes to thousands of clients, while making individual clients only aware of the data that impacts them (except there is no direct association between a client and the data that impacts it)?

Intro

My day job involves working on a small part of a large system. Some of the larger parts of the large system are running into problems with their design (we're on design #3 now) so my coworkers are looking for a broader audience. I don't have tons of experience in the area or a ton of time to scour case studies. I know some folks here work on large distributed systems so here we are.

The Problem

The essential problem is propagating changes across the system. We have a database for our working data, a set of machines to provide service access to that data, a set of machines to perform long running jobs off of that data, a web server to change the data, and need (eventually) a few thousand clients per deployment.

The database contains a bunch (~500) of file references, and a bunch of metadata about the files. It also contains info about the clients. The system uses the file metadata, and the client metadata and determines what files need to be on what clients. Also there is information generated based on the file and client metadata that is time-specific (right now, a nightly process generates the data for the next day).

The problem is how to handle changes to the system. If someone adds a new file, if someone adds a new client, if any of the metadata changes and invalidates the generated data/file-client matching... the 'current state' for the clients needs to be regenerated. And business wants those changes propagated to the clients 'as fast as possible'. We've gotten them to be okay with 15-30 minutes.

Data Flow

The bright side (if one could call it that) is that updates to the system should be infrequent. We will have at most 5-10 users writing to the system at once. 0-1 is more likely. The clients... they exist outside of our control even though we write the software for them. They are not guaranteed to be reliably connected to the internet, and dial-up is a possibility still. Plus, since they're out of our control, we aren't keen on putting tons of our data on them.

What we've tried

First idea was to have a NoSQL sort of database with the current snapshot of things. Then the 'current state' for each client was re-generated off of that as quickly as possible. This failed pretty hard since there's horrible transaction support in what was chosen, and no good way to snapshot what amounts to the entire DB.

The second idea was to have all of the mutation in the DB update a set of 'dirty' clients. A workflow would go through and only generate the current state for ones that needed updating. This failed pretty hard since even with the decreased workload, the queries hit a 10 minute timeout once more than a dozen or so clients needed processing.

Problem points

- Yes, I know... the whole metadata matching thing is the problem. Reduce the complexity of the data interactions, reduce the DB and/or processing hit. I already lost that argument months ago and there's not time to un-fubar the thing.
- The DB is sitting on the beefiest hardware HP makes; no quick fix there.
- No, we (probably) can't just add more DB machines and split the data. A bit of the time-sensitive data generation requires the full view of the data to make its decisions.
- While it sucks, there is good business reason for the prompt client updates, so we can't just regenerate it all overnight.
- The 'current state' at the client needs to be correct. We've already moved some of the processing off to them, so missing a message/event means that the processing goes wrong which in turn leads to worse errors down the line.

How you can help

At this point I'm just looking for links and ideas. If you've done something similar, what did you do? How did it go? Bad ideas are good too; saves us time finding that out ourselves.

Thanks in advance. I can clarify as needed (where I can).

-- Tangent Developer Journal - Delegates - Part 2

ApochPiQ

23,138

April 04, 2012 05:14 AM

Can you classify clients into groups? e.g. this change affects all Foo clients and all Bar clients, but not Baz or Quux clients.

Even still... 500 files and 10,000 clients is only 5 million pairings. If it takes longer than a few seconds to chug through that it sounds like you have some serious algorithmic complexity issues, or maybe a really pathological bottleneck (hitting disk repeatedly for data that could be cached in memory, etc.). I'm hard pressed to imagine what's so complicated that it takes 15-30 minutes to process on such a relatively small scale. Maybe you could give some hints as to why the processing is so slow?

Wielder of the Sacred Wands
[Work - ArenaNet] [Epoch Language] [Scribblings]

Telastyn

3,778

Author

April 04, 2012 12:41 PM

Yes, we could (and do) put the clients into groups, which is sadly part of the problem. The clients exist in a hierarchy, and much of the metadata can be placed at any point of the hierarchy. Basically ruining any nice join behavior the database might be able to do.

Another problem point is that the metadata isn't terribly structured. It's arbitrary key/value pairs (except when it's key/ list of value pairs). When there is multiple values for the key, the join behavior is based off one side being a superset of the other. Not too bad, but not great either.

The time-specific generated data is much larger, but honestly we haven't even gotten to that yet due to these other issues. That data is essentially a schedule for each client. For each 10 minute block, for each client 24/7/365, we use the metadata to determine what files are valid for the block. For all the blocks across the system, files are placed into the blocks based on some statistical frequency. Sadly, system changes will need to adjust these too.

I am in charge of making the client software, so don't have a ton of insight into the problems at the backend of the system. As far as I can gather, it is absurd algorithmic complexity due to absurd business desires.

-- Tangent Developer Journal - Delegates - Part 2

the_edd

2,109

April 04, 2012 06:50 PM

Communicating changes across a system scalably?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Communicating changes across a system scalably?

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines