Communicating changes across a system scalably?
Crossbones+ - Reputation: 3624
Posted 03 April 2012 - 06:22 PM
How do you communicate database changes to thousands of clients, while making individual clients only aware of the data that impacts them (except there is no direct association between a client and the data that impacts it)?
My day job involves working on a small part of a large system. Some of the larger parts of the large system are running into problems with their design (we're on design #3 now) so my coworkers are looking for a broader audience. I don't have tons of experience in the area or a ton of time to scour case studies. I know some folks here work on large distributed systems so here we are.
The essential problem is propagating changes across the system. We have a database for our working data, a set of machines to provide service access to that data, a set of machines to perform long running jobs off of that data, a web server to change the data, and need (eventually) a few thousand clients per deployment.
The database contains a bunch (~500) of file references, and a bunch of metadata about the files. It also contains info about the clients. The system uses the file metadata, and the client metadata and determines what files need to be on what clients. Also there is information generated based on the file and client metadata that is time-specific (right now, a nightly process generates the data for the next day).
The problem is how to handle changes to the system. If someone adds a new file, if someone adds a new client, if any of the metadata changes and invalidates the generated data/file-client matching... the 'current state' for the clients needs to be regenerated. And business wants those changes propagated to the clients 'as fast as possible'. We've gotten them to be okay with 15-30 minutes.
The bright side (if one could call it that) is that updates to the system should be infrequent. We will have at most 5-10 users writing to the system at once. 0-1 is more likely. The clients... they exist outside of our control even though we write the software for them. They are not guaranteed to be reliably connected to the internet, and dial-up is a possibility still. Plus, since they're out of our control, we aren't keen on putting tons of our data on them.
What we've tried
First idea was to have a NoSQL sort of database with the current snapshot of things. Then the 'current state' for each client was re-generated off of that as quickly as possible. This failed pretty hard since there's horrible transaction support in what was chosen, and no good way to snapshot what amounts to the entire DB.
The second idea was to have all of the mutation in the DB update a set of 'dirty' clients. A workflow would go through and only generate the current state for ones that needed updating. This failed pretty hard since even with the decreased workload, the queries hit a 10 minute timeout once more than a dozen or so clients needed processing.
- Yes, I know... the whole metadata matching thing is the problem. Reduce the complexity of the data interactions, reduce the DB and/or processing hit. I already lost that argument months ago and there's not time to un-fubar the thing.
- The DB is sitting on the beefiest hardware HP makes; no quick fix there.
- No, we (probably) can't just add more DB machines and split the data. A bit of the time-sensitive data generation requires the full view of the data to make its decisions.
- While it sucks, there is good business reason for the prompt client updates, so we can't just regenerate it all overnight.
- The 'current state' at the client needs to be correct. We've already moved some of the processing off to them, so missing a message/event means that the processing goes wrong which in turn leads to worse errors down the line.
How you can help
At this point I'm just looking for links and ideas. If you've done something similar, what did you do? How did it go? Bad ideas are good too; saves us time finding that out ourselves.
Thanks in advance. I can clarify as needed (where I can).
Moderators - Reputation: 10480
Posted 03 April 2012 - 11:14 PM
Even still... 500 files and 10,000 clients is only 5 million pairings. If it takes longer than a few seconds to chug through that it sounds like you have some serious algorithmic complexity issues, or maybe a really pathological bottleneck (hitting disk repeatedly for data that could be cached in memory, etc.). I'm hard pressed to imagine what's so complicated that it takes 15-30 minutes to process on such a relatively small scale. Maybe you could give some hints as to why the processing is so slow?
[Work - ArenaNet] [Epoch Language] [Scribblings] [Journal - peek into my shattered mind]
Crossbones+ - Reputation: 3624
Posted 04 April 2012 - 06:41 AM
Another problem point is that the metadata isn't terribly structured. It's arbitrary key/value pairs (except when it's key/ list of value pairs). When there is multiple values for the key, the join behavior is based off one side being a superset of the other. Not too bad, but not great either.
The time-specific generated data is much larger, but honestly we haven't even gotten to that yet due to these other issues. That data is essentially a schedule for each client. For each 10 minute block, for each client 24/7/365, we use the metadata to determine what files are valid for the block. For all the blocks across the system, files are placed into the blocks based on some statistical frequency. Sadly, system changes will need to adjust these too.
I am in charge of making the client software, so don't have a ton of insight into the problems at the backend of the system. As far as I can gather, it is absurd algorithmic complexity due to absurd business desires.
Members - Reputation: 1007
Posted 06 April 2012 - 04:47 PM
As for synchronization of client datasets, some sort of customized hash solution for progressive syncing and push scheme might work. Maybe you can have the client run all time in background pinging server and sync in background, but might be expensive on bandwidth for user if they are un-aware its happening. If done right they won't ever have to manually sync, kinda like windows update.
Crossbones+ - Reputation: 3624
Posted 06 April 2012 - 05:22 PM