GT.M Logical Multi Site (LMS) and eventual Consistency

Excerpt from NoSQL discussion forum - Bhaskar's notes about replication

From: "K.S. Bhaskar"
Date: Tue, 17 Nov 2009 02:43:41 -0800 (PST)
Topic: GT.M Logical Multi Site (LMS) and eventual Consistency [warning, long]

On another thread, I promised to post a discussion of GT.M and eventual Consistency (a stronger requirement than eventual consistency).

Eventual consistency refers to multiple instances of an application where the database states are not identical at instants of time, but where they are all guaranteed to reach the same state. Given a sufficient period of time with no updates, they will all be identical.

There are applications (for example in financial services) where the above model of eventual consistency does not suffice. In these applications, not only must all instances reach the same state, they must all reflect the same path through state space. I'll use the term eventual Consistency for this. GT.M has functionality that allows the creation of logical multi-site (LMS) application configurations that achieve eventual Consistency. Although the database provides the underlying functionality to create application configurations where eventual Consistency includes the path through state space, the database engine alone does not have sufficient knowledge to restore Consistency, which requires the database engine and application logic to cooperate and share responsibilities.

[Note: the GT.M database engine ensures ACID (Atomic, Consistent, Isolated, Durable) properties on the node executing business logic for any application code bracketed by TStart / TCommit commands. A transaction is committed by writing to a journal file before writing to the corresponding database file. Although a gross simplification, for the purposes of this high-level write-up, it suffices to consider each update in any application code not bracketed by TStart / TCommit commands as a mini-transaction.]

The basic LMS model has one instance at which all business logic is executed at any given point in time (the originating primary instance). On that instance, in parallel with writing to the journal file to commit a transaction, update information is also written to an area of shared memory called a Journal Pool, from where Source Server processes stream them in real time over TCP connections to as many as sixteen other instances (called replicating secondary instances). The streaming is asynchronous, so that in CAP (Consistency, Availability, Partitioning) terminology, GT.M LMS is an AP application. In practice, the streaming is very fast, and parsimonious of bandwidth usage so that under normal operating conditions, the update information can be expected to leave the originating primary instance in the sub-millisecond range. In the event a network connection from an originating primary to a replicating secondary instance is interrupted, when it is restored, the secondary catches up automatically while primary operation continues normal operation.

Each secondary instance can stream to sixteen tertiary instances and so on (for a total of 256 tertiary instances, 4096 quatenary instances, etc.). In the event an originating primary instance goes down (for an unplanned or planned event), any downstream instance can in principle become the new originating primary instance. In practice, network considerations usually mean that one of the sixteen replicating secondary instances becomes the new originating primary instance. When the original originating primary instance recovers, it becomes a replicating secondary instance of the new originating primary instance. In other words, all instances can be considered equals, except that any given point in time, only one instance executes business logic.

Now, let us consider two instances, A and B, where initially A is an originating primary instance (P) and B is a replicating secondary instance (S). Consider this example where A is an originating primary instance that has just committed transaction 100, and B is a replicating secondary instance that has just replicated transaction 95 (so, for whatever reason, there is a five transaction backlog in the replication from A to B).

A	B	Comments
P:100	S:95	Initial state
Fails
	P:95	B becomes the primary
	P:120	B processes 25 transactions as primary

Note that at this point, transactions 96 through 100 processed by A are different from transactions 96 through 100 processed by B. Also, the transactions on A are temporarily Unreplicated Transactions.* When A recovers it comes up in a secondary role. In order to restore Consistency, as part of re-establishing communications with B as P and A as S, A performs a rollback of transactions 96 through 100. The rolled back transactions are placed into an Unreplicated Transaction File (UTF). A then resumes operation as a normal S to B operating as P:

A	B	Comments
S:95	P:120	A's transactions 96 through 100 are now in UTF
S:120	P:120	A catches up with transactions processed by B

We still have not dealt with the UTF. As soon as A performs its rollback, the UTF can be sent to B, the new P, for Unreplicated Transaction Processing (UTP). This is where application logic is required to restore Consistency. Although GT.M provides tools to decide whether an update in the UTF from A can be applied to the database without colliding with another update performed by B, for the types of applications for which LMS configurations are used, it is not usually that simple. In practice, UTP must examine each transaction in the UTF, decide whether it can simply be reapplied, or whether other processing is required.

So, as a simple example, if the UTF from A contains an ATM withdrawal I make, and B's database contains an ATM withdrawal made by my wife subsequent to my withdrawal, UTP may result in the order of the transactions being reversed. As long as our bank account contains enough money for both, the order doesn't really matter - but both A and B must agree on the order (i.e., both must show the same path through the database).

For a more complex example, let's say that I use the web to transfer money from my savings account to my checking account, and this is processed by A. For my next transaction, I initiate an electronic payment from my checking account, but before I do that, A crashes with the transfer not replicated to B. The payment is processed by B as a result of which I am then assessed an overdraft protection service charge for insufficient funds. UTP on B should recognize the transfer in the UTF and reverse the service charge and overdraft protection. [Note that the service charge and overdraft protection are never electronically erased out of existence - UTP will initiate new transactions, so when I get my bank statement, I will see the service charge, followed promptly by a reversal. I may think "those stupid bankers" not realizing that this is a side effect of a very rare combination of circumstances that kept the bank operating, providing eventual Consistency with identical paths through state space.]

Some of those transactions in the UTF may be routine scheduled "batch" processing (such as posting interest that was previously accrued). When B starts operating as the new primary instance, it simply picks up interest posting based on the state of the database that does not include interest posting already performed by A. UTP can simply ignore these transactions in the UTF. So, in our example, if three of the five transactions in the UTF came from scheduled processing, only two actually require reprocessing, and the system gets to the following state:

A	B	Comments
S:122	P:122	UTP completed

If you have a rock solid UTP, here is an example of an even more complex failure scenario. A is an originating primary instance streaming to replicating geographically separated secondary instances B and C. The network dies each server and clients in its region. Now, A, B and C can all operate as parallel originating primary instances. When connectivity is restored, the rollback / UTP procedure can be used to bring A, B and C all back into Consistency with the same path through state space.

This has turned out to be a longer post than I intended. If you read this far, thank you for your patience! Questions are welcome.

Regards -- Bhaskar

* Unreplicated Transactions have historically been called Lost Transactions. For good reasons - none technical - they are in the process of being renamed to Unreplicated Transactions.

From: Jason Dusek
Date: Tue, 17 Nov 2009 10:39:36 -0800
Topic: Re: GT.M Logical Multi Site (LMS) and eventual Consistency [warning, long]

Thank you for your message. It's great to see a blow-by-blow description of transaction processing with GT.M.

How are transactions identified? Are they given UUIDs by each GT.M node that receives them? Are transactions ever "too old" such that they need be ignored? Are there scenarios where a transaction must be aborted in the event of network outage or delay?

-- Jason Dusek

From: "K.S. Bhaskar"
Date: Tue, 17 Nov 2009 16:25:11 -0800 (PST)
Topic: Re: GT.M Logical Multi Site (LMS) and eventual Consistency [warning, long]

On Nov 17, 1:39 pm, Jason Dusek wrote:

> Thank you for your message. It's great to see a blow-by-blow
> description of transaction processing with GT.M.

> How are transactions identified? Are they given UUIDs by each
> GT.M node that receives them? Are transactions ever "too old"
> such that they need be ignored? Are there scenarios where a
> transaction must be aborted in the event of network outage or
> delay?

[KSB] On the node where they are computed and committed, transactions are simply identified by a monotically increasing 64-bit integer. Transactions are never too old such that they need to be ignored - but if you delete older generation journal files, rollback and catchup operations may fail. Since LMS transactions are always computed and committed locally, a transaction is never aborted because of network delays (of course, a client may give up waiting for a transaction response, but once the transaction is received by the originating primary instance, there is zero network interaction).

Ideally the duration between TStart and TCommit is small (to keep the probability of both real collisions as well as false positives low). A well engineered application will do no interaction with the outside world inside a transaction, except for the database, and will aim to minimize the code that is executed inside a transaction.

Regards -- Bhaskar

From: Kunthar
Date: Wed, 18 Nov 2009 03:54:24 +0200
Topic: Re: GT.M Logical Multi Site (LMS) and eventual Consistency [warning, long]

Here is my questions; 1. How far those geographically separated instances? New York, Paris, Hong Kong, Berlin, Singapore and Sydney for example.

2. Have you got any real world instalment? How large are they?

3. What is the overall insert, update, get times from this kind of geographically separated instances?

4. The basic LMS model has one instance at which all business logic is executed. Which node will answer to the requests? Only single node as entry point?

If yes, how you could eliminate internet centric latency problems? For example request made from Sydney but main primary instance is in Berlin.

Thank you,

From: "K.S. Bhaskar"
Date: Tue, 17 Nov 2009 20:03:46 -0800 (PST)
Topic: Re: GT.M Logical Multi Site (LMS) and eventual Consistency [warning, long]

On Nov 17, 8:54 pm, Kunthar wrote:

> Here is my questions;
> 1. How far those geographically separated instances?
> New York, Paris, Hong Kong, Berlin, Singapore and Sydney for example.

[KSB] The largest separation I personally know of is around 1000 miles. I have been told that there is a replication between Hawaii and a data center on the US mainland, but I cannot confirm this. Since replication only requires a TCP connection, if you can set up a TCP connection, you can replicate over it.

> 2. Have you got any real world instalment? How large are they?

[KSB] GT.M has been in production since 1986 (and under continuous development since before that). It is currently the system of record for tens of millions of bank balances around the world (see http://fis-profile - it is deployed on GT.M). Have you visited the GT.M web site (http:// fis-gtm.com)? It has the names of several users who have agreed to be featured. Look at http://www.fidelityinfoservices.com/FNFIS/Markets/NonFinancialIndustr... for an example of a large bank. The largest production sites have databases in the hundreds of GB to a few TB.

> 3. What is the overall insert, update, get times from this kind of
> geographically separated instances?

[KSB] Remember that transactions are computed and committed locally on originating primary instances and streamed asynchronously to replicating secondary instances. So distances between instances doesn't matter. As for performance, quoting from http://www.fidelityinfoservices.com/FNFIS/NewsRoom/20061011.htm

"The FIS Profile real-time core banking solution was benchmarked at more than 3,000 on-line banking transactions per second on a 5 million account database and 2,600 transactions per second on a 50 million account database. Profile also processed 50 million interest accruals and balance accumulation updates at 13,224 transactions per second (48 million per hour)."

By the way, this was three years ago. Since then, we have benchmarked substantially better numbers, but those more recent benchmarks have not been published.

> 4. The basic LMS model has one instance at which all business logic is
> executed. Which node will answer to the requests? Only single node as
> entry point?

[KSB] Yes, whichever node is the designated originating primary instance.

> If yes, how you could eliminate internet centric latency problems? For
> example request made from Sydney but main primary instance is in
> Berlin.

[KSB] For a client is in Sydney served by an application in Berlin, there will be an additional delay (likely in the many tens to small hundreds of milliseconds) on the round trip. When we benchmark throughput, we measure latency from receipt of the message with a transaction on the network interface of the server to the time a confirmation message is sent from the network interface of the server.

Regards -- Bhaskar