Distributed transaction & message queue

Started by
11 comments, last by hplus0603 9 years, 4 months ago

Recently I've been working on finding a proper solution for distributed transaction, and I found this: http://www.codeofhonor.com/blog/wp-content/uploads/2012/04/Patrick-Wyatt-Writing-reliable-online-game-services.pdf

The pdf file's content comes from a lecture by Patrick Wyatt in GDC 2012, and I believe many guys have read it.

In this lecture, Patrick showed how he solved the consistency of trade between two players distributed in two different databases(not by using XA), the approach is to achieve 'eventual consistency' by using a consistent 'transaction queue'. I will explain this approach below briefly.

------------------>>>>

1. Let's suppose player1 is in db1, player2 is in db2. And player1 wanna give $10 to player2.

2. We first commit a transaction to db1, like this


begin_transaction_in_db1
  update player set money = money - 10 where id = money_giver_id;
  insert into transaction_queue values(db2, 'increase_player_money', money_receiver_id, 10);
end_transaction

What we insert into the table transaction_queue is essentially a promise-- we will increase the money_receiver's money in the future.

3. A worker process will scan the table transaction_queue in db1, and when it gets the new record inserted above, it will commit a transaction to db2, like this


begin_transaction_in_db2
  update player set money = money + 10 where id = money_receiver_id;
end_transaction

4. By implementing like above, we can acheive eventual consistency.

---------------<<<<

I have found many similar idea like this, to avoid using the high-latency XA function provided by mysql.

But I'm really wondering: what if transaction2 fails while transaction1 succeeds? Apparently, we will be unable to rollback transaction1 when we find transaction2 has failed, because at that time transaction1 has been committed already...

Advertisement


what if transaction2 fails(the money_receiver does not exist any more when transaction2 is executed, e.g.)

There are two kinds of fails. One is, when the communication fails (servers crashes, networkcommunication aborted, harddisk crahed etc.) and when the data consistency fails. The first idea is, to ensure data consistency (just check it upfront), then execute the transaction. If the second (db2) transaction failed, you try to repeat it (therefor you need idempotent transactions). If it fails because of some data inconsistency, you need to revert your first transaction. Eventually there should be always some other security systems installed like don't payout any real cash until all transactions in the second db have been successfully commited.

There is always a trade off, you can't have 100% transaction save behavior most of the time, but you can have 99.99%. If one case of the 0.01% fails, you need to handle it in an other way, most often manually. It is just important, that

1. you need a (hi-performance) way of handling the transaction for 99.99% of the data.

2. you need a way to get informations about "hard" fails.

3. you need a way to (manually) fix these hard fails.

When you use message queuing, you have to only use messages that are eventually fulfillable.
"Give X credits to player Y" is always eventually fulfillable, unless you have some business rule that forbids a balance over a limit.
If you do have such a limit, the message would be "Give X credits to player Y, but if the limit is hit, send the remaining credits to location F and send an email to the giver and recipient" or something like that.

However, "give value from A to B" isn't a very interesting use case, because it is inherently always fulfillable.
What's more interesting is two-party things, like "give X credits from A to B while transferring the flaming sword of fire from B to A."
That would have to be formulated something like:
Message A->B: "Here's X credits, give them to B and enqueue a message to give the sword to A, OR enqueue a message to refund the credits."
Message B->A: "Here is the sword." OR "Here are the credits back."
Keep re-formulating the messages and the protocol until the likelihood of failure is very small.
Also, "eventual consistency" matters here. If system B is down for a while, that's OK, as long as it eventually comes back and processes the message.
If you're in an environment where that can't be guaranteed, then the message becomes a little more complex:
Message A->B: "Here's X credits, give them to B and enqueue a message to give the sword to A, OR enqueue a message to refund the credits, OR if timing out after T time, refund the credits."
Message B->A: "Here's a sword. If timing out after time T, you're screwed and need manual intervention."

At some point, any system will run into some limitation. Maybe there's a rule that you can only own one elemental sword, and when the sword returns, A already owns the freezing sword of frost. Maybe you saved space in the database by only allocating 16 bits of credits storage, and B went over the 65,535 credit limit. Maybe player A gets banned and deleted from the system before the sword makes it there.
At some point, you will have to throw an exception (metaphorically -- not necessarily using your language's exception mechanism) and kick the action up to an operator/human. When this happens, it's generally super helpful if the operations in the system leave good audit trails, using some traceable unique ID, and you have good tools to actually see each step that was taken in the chain.
enum Bool { True, False, FileNotFound };

THX to @Ashaman73 and @hplus0603, I like the point 'There is always a trade off' and 'At some point, any system will run into some limitation', they are helpful to mesmile.png

Escrow-style services can be useful for that kind of transaction if you fear a multi-part transaction can fail.

This becomes important if you really do need audit trails as hplus mentioned.

For example, the {give money from A to B, give sword from B to A} can work well with a queue, as the escrow service becomes the queue consumer. Neither A nor B performs the transaction, the work is done through the escrow service.

The transactions then become a series of smaller steps. The first is to first mark the request as started if you have multiple consumers, but not remove it from the queue, so other workers do not consume them. Then the service performs the transfers, presumably having a unique ID assigned when it entered the queue so you have an idempotent transfer. Each of the sub-steps also needs to be atomic for the system to work, so you need an appropriate level of locking or multiversioning either by yourself or by your database at that level; there are libraries and databases that do it, don't re-invent the wheel since the details of atomic simultaneous write/write and read/write are notoriously tricky. For instance a combination { TransactionId: ###, from:A, to:B, object:money } so that if the money transfer were to be repeated it could verify A already had the withdrawal of that ID and B already had the deposit from that ID, if either was missing it could be repeated and if present it would not. That sub-transaction needs to be properly logged in the audit system and be atomic, which can be a neat database trick of either commit or rollback. Then you have the second similar idempotent transfer { TransactionId: ###, from:B, to:A, object:sword } which if the transaction ID has not already been run removes the sword from their inventory (potentially re-validating its presence if we are talking about a more serious system) and adds it to the other side's inventory, again done with a commit or rollback atomic system. It is idempotent since you can verify if an item has already been transferred with that specific transaction ID preventing duplicate removals and duplicate additions. Finally, the queue consumer acting as an escrow agent marks the transaction as complete and removes it from the queue. Now both systems have completed all the transactions and all is well. That will handle the 99.99% example above. If any step fails for that 0.01% (like the sword was already removed from one side's inventory before the transaction is run and the locking/multiversioning mechanism somehow failed) you can run a reversal transaction. If that also fails (a 0.01% on top of another 0.01% chain of failures) then you flag it for humans to solve and move on.

Depending on your level of confidence you may even need additional verification that the sub-steps are atomic and idempotent. It all depends on how far down the rabbit hole you want to go.

An enormous number of games suffer from exploits due to poor handling of these transactions, where you can find ways to duplicate items or add the difference between items by doing an undo-style transfer (e.g. sell then buyback from shopkeeper before closing the screen) or rollback (e.g. start doing the thing that consumes resources then cancel back out to recover resources). It doesn't make much sense to spend much time on this kind of thing for local games, but for online games where in-game resources can have cash value on marketplaces it becomes much more significant.

The one problem i have is that this does nothing to prevent a double spend exploit.

In fact, doing everything in one transaction doesn't do the trick either.

When your player trade starts, the first thing that happens is some sort of logic test to make sure that the player should be allowed to perform the trade (ie does the player have enough gold?).

Then it actually performs the transaction(s).

If a second trade spins up between these two steps then the user will have the ability to spend into the negative since your logic tests will see old data.

You would have to have some way to prevent other database sessions from reading this data as well as editing it so that only one trade happens at a time. (maybe just locking the records?).

Database transactions aren't a silver bullet.

The one problem i have is that this does nothing to prevent a double spend exploit.

Sorry, but it really is a solved problem.

It was solved centuries ago in the physical world with escrow services, and was solved decades ago in the electronic financial industry with a variety of different algorithms, including the one presented in my post above.

You have two different systems, each with the potential to fail, or not satisfy the requirements at the time of transaction, or be fraudulent, or be cancelled in the future. Transactions like {buy ticket on ticketing system, transfer funds with banking system} happen all the time around the globe. Others like {remove item from inventory tracking system, transfer funds with banking system} happen thousands of times every second at major retail chains.

You start with both systems using databases that satisfy ACID requirements, then you use your own intermediate system with idempotent transactions that handle duplicate events, auditing, and enable implementing product returns or mid-transaction exceptions.

This is very much a solved problem. It requires quite a few small steps, but done correctly the steps can be safely re-run multiple times, safely be interrupted and resumed, safely be cancelled mid-process during an exceptional process like insufficient funds or discovery of fraud, and can safely be cancelled through a second later transaction. It can also generate audit trails adequate for everything including government investigation and insurance investigation.

The one problem i have is that this does nothing to prevent a double spend exploit.

Sorry, but it really is a solved problem.

It was solved centuries ago in the physical world with escrow services, and was solved decades ago in the electronic financial industry with a variety of different algorithms, including the one presented in my post above.

You have two different systems, each with the potential to fail, or not satisfy the requirements at the time of transaction, or be fraudulent, or be cancelled in the future. Transactions like {buy ticket on ticketing system, transfer funds with banking system} happen all the time around the globe. Others like {remove item from inventory tracking system, transfer funds with banking system} happen thousands of times every second at major retail chains.

You start with both systems using databases that satisfy ACID requirements, then you use your own intermediate system with idempotent transactions that handle duplicate events, auditing, and enable implementing product returns or mid-transaction exceptions.

This is very much a solved problem. It requires quite a few small steps, but done correctly the steps can be safely re-run multiple times, safely be interrupted and resumed, safely be cancelled mid-process during an exceptional process like insufficient funds or discovery of fraud, and can safely be cancelled through a second later transaction. It can also generate audit trails adequate for everything including government investigation and insurance investigation.

However, it is NOT solved by the example provided.

This is why YOU said


Depending on your level of confidence you may even need additional verification that the sub-steps are atomic and idempotent. It all depends on how far down the rabbit hole you want to go.
Yes, if you are not confident in your underlying systems, particularly that they satisfy ACID requirements, then you need to verify those as well.

Most businesses are quite willing to trust that their database servers from Postres or MySQL or Oracle or whatever satisfy those requirements. Others will create test plans that include testing those systems, and they occasionally will discover database system bugs.

Yes, if you are not confident in your underlying systems, particularly that they satisfy ACID requirements, then you need to verify those as well.

Most businesses are quite willing to trust that their database servers from Postres or MySQL or Oracle or whatever satisfy those requirements. Others will create test plans that include testing those systems, and they occasionally will discover database system bugs.

You are forgetting that ACID databases do not lock reads while you have a transaction open.

If something else reads the value during your transaction it will see the old value prior to your modification since your changes are queued up in a working copy of the modified records until you commit.

If you don't take that into consideration, you will end up with a situation where you can spend some gold while the prior action of spending gold is still in process. The ACID compliant database won't help you there.


multiversion concurrency control, in which the database provides each reading transaction the prior, unmodified version of data that is being modified by another active transaction. This allows readers to operate without acquiring locks, i.e. writing transactions do not block reading transactions, and readers do not block writers. Going back to the example, when user A's transaction requests data that user B is modifying, the database provides A with the version of that data that existed when user B started his transaction. User A gets a consistent view of the database even if other users are changing data.

Locking the data from read access while you are writing to it in a transaction is not the default action of ACID databases. You have to manually lock it.

(For some reason I can't remove that VVVVV )

This topic is closed to new replies.

Advertisement