When we started down the road of building a next-generation platform at TT just six years ago, we had a vision to bring forward new-to-industry technologies that would provide better tools, new offerings and efficiencies to our clients. We had high expectations to deliver on those goals. Many of us on the leadership team here have prior experience working at trading firms, as well as other vendors, and we know the demands of our sophisticated clients and the expectations of traders.
Since the beginning of this platform’s life in production, we have taken profound efforts to listen to our clients, traders and partners to implement that vision and began the propitious task of migrating our customer base. Unfortunately, with Tuesday’s outage we fell short of everyone’s expectations. We understand apologies and explanations can only go so far. But it is important that we share with you our plans to rectify the gaps and be transparent to slowly build back your trust.
Actions and delivering on our promises will be the ultimate proof.
To be clear, the progress of migrating firms and end-users from X_TRADER® to the TT platform is happening by closely partnering with our clients at a firm-level, and implementation goals have been set. But we are not forcing any trader to move to the TT platform until there is confidence of stability. We are working closely with our clients on timelines, but there is no edict from us on a mandatory turn-off date for X_TRADER. And while we have communicated sunset dates, those assumed a level of stability is met and proven on the TT platform that meets the high expectations of our global clientele.
To help understand the following timeline, it’s important that we first explain two important components of the TT platform as well as how we do load testing.
What is Zookeeper and how is it used within TT?
Zookeeper is an open source project from the Apache Software Foundation. It is used in distributed systems as a real-time directory of running services and configuration information.
In the case of TT, we use Zookeeper as this directory allowing different types of components to know about the state of one another. For example, the Edge server which the trading screen connects to as well as the pre-trade risk service both know if order routing is available by seeing the Order Gateway’s state in Zookeeper. Within the Order Gateway cluster for each market, Zookeeper is used for the different instances of the gateways to know about one another and manage automatic failover (e.g., if one Gateway were to die, the others would know because it would be removed from the directory and thus they would know to take over the exchange sessions the dead gateway was servicing).
Relevant to the outage on August 13, within TT, Zookeepers are deployed on the same physical servers as another service named UMP.
What is UMP and why does TT use it?
Ultra Messaging Persistence, a/k/a UMP, is a middleware product sold by Informatica (formerly 29West). TT uses UMP for guaranteed delivery of a subset of messages on our internal multicast network. In TT, since Order Gateways are the source of execution reports from the exchanges, only they are “senders” on UMP. Each time an Order Gateway broadcasts a message that must be “guaranteed,” it sends it via multicast as well as writes it directly to the UMP receiver over a direct TCP connection. If a receiver sees that it has missed a message from a sender, it can go to the UMP server to retrieve it.
How have we been doing load testing to date?
Our approach to load testing has historically used a two-pronged approach with a dedicated environment for each.
We do component-specific load tests, isolating the component being tested, to identify throughput/capacity limits and identify breaking points. These tests are automated so that they are run multiple times a week on new versions of code.
We also do broader cross-component load tests; some regularly, but most on an ad hoc basis. For example, our compliance with MiFID II requires that we execute a set of defined stressed and disorderly market tests based on the regulatory requirements. These are run daily. In addition, depending on what components are being released/changed or possibly in response to performance issues observed in internal test environments or by customers in production, test suites are designed to stress the system in unique ways and are then executed in this “stability” environment. One of the major lessons learned from the last few months including what occurred on August 13 is that this ad hoc process needs to be more rigorous and structured than it has been to date.
Timeline for August 13
8:32 AM CT
The incident began when customers saw a “toast” message in the UI stating “Order routing is currently unavailable for CME” as we observed that Order Gateways supporting CME connectivity were restarting.
TT Support immediately realized the potential severity of the issue and a bridge line was opened within two minutes including our head of engineering, CTO, CIO and head of site reliability engineering.
Upon inspecting logs on the gateways, the cause of the restart was identified as the loss of connectivity to Zookeeper.
Order Gateways restart when losing connection to Zookeeper for more than 30 seconds because losing that connection is equivalent to being removed from the cluster. The reverse is also true; if Zookeeper misses heartbeats from an Order Gateway, it will kick that instance out of the cluster to force a failover. A restarted Order Gateway will re-register with Zookeeper once it is online, thus rejoining the cluster.
8:42 AM CT
After 10 minutes, the Order Gateways had not fully initialized and the decision was made to restart them one more time to ensure they were initializing cleanly given the fact they all had lost Zookeeper connectivity simultaneously.
Upon further investigation, it was seen in logs that the Zookeeper cluster itself restarted. The cause was that the leader instance was timing out communicating with the two follower instances.
As the Order Gateways came back online and appeared stable, the TT team split the group across two tasks (1) actively monitoring the Zookeeper instances and Order Gateways and (2) deeper investigation into what caused the issues with the Zookeeper cluster.
9:12 AM CT
A single Order Gateway restarted automatically again and the team began to work on a contingency plan to address potential ongoing Zookeeper issues.
9:24 AM CT
The decision was made to do a rolling restart of the three Zookeeper instances in the CME data center. This action has been taken many times before in both internal and production environments without incident. Unfortunately this time was different. When the first Zookeeper instance restarted, one of the four CME Order Gateways restarted unexpectedly, again complaining about losing connectivity to Zookeeper.
Subsequently over the next few minutes, another two Order Gateways restarted, though this time the cause identified in the logs was slowness to write to UMP.
Order Gateways restart when repeatedly failing to write to UMP because losing that connection means execution report delivery cannot be guaranteed.
At the same time that we were seeing Order Gateways struggling to maintain connectivity to both critical services (UMP and Zookeeper) that are deployed on the same physical servers, two other observations were made.
- It was observed that disk I/O was higher than normal on these servers; not so high that it was fundamentally broken, yet clearly a tipping point was being neared or breached.
- The nature of the original Zookeeper failures was being better understood by engineers after more analysis, and the theory was forming that slow writes on the “follower” nodes were what caused the “leader” node to remove them from the cluster and restart.
Given these two observations, the decision was made to separate these two services onto separate physical servers.
9:58 AM CT
We deployed the first new Zookeeper node on a new server. When this came online, one of the Order Gateways restarted again, exhibiting the same previously unseen behavior when the rolling restart was attempted.
10:14 AM CT
The second new Zookeeper node came online on a new server and one of the Order Gateways restarted again, exhibiting the same previously unseen behavior when the rolling restart was attempted.
10:16 AM CT
When we shutdown the final Zookeeper node so that it could be moved to a new host, all Order Gateways restarted yet again.
At this time, a second previously unseen behavior with Zookeeper was observed where the two “follower” nodes in the cluster were not taking over leadership when the “leader” node was shut down. This has been tested countless times in internal environments at TT without issue but in this case, without a “Zookeeper leader,” the Order Gateways restarted.
10:18 AM CT
We brought the Zookeeper leader back up on its original server and service stabilized.
Root cause summary
Resource contention between UMP and Zookeeper services deployed on shared physical servers caused two issues resulting in the same basic behavior. First, Zookeepers slowed down as a result of the contention and (a) lost connectivity with one another resulting in a gateway restart and (b) lost connectivity to individual Order Gateways. Second, UMP slowed down, causing a disconnect from Order Gateways.
Knock-on issues post-outage
Unfortunately, the repeated restarts of Order Gateways triggered two defects that were already known and fixed in our UAT environment. Both fixes are pending production deployment soon. These two defects can only be triggered in the case of a restart and result in execution reports from the exchange being missed for some accounts/users.
Later in the day it was observed that some synthetic (i.e., algo and TT synthetic order type) orders cancelled after the Zookeeper instances that were moved to new physical servers were not properly cleaned up, resulting in some algos continuing to run.
Short-term actions
- Beginning this weekend and over the next few weeks, Zookeeper services will be moved to separate hosts from UMP.
- Beginning this weekend and over the next few weeks, all UMP hosts will be switched to solid state drives. These types of storage devices are far superior to traditional spinning disk drives, which will increase the servers’ disk I/O capacity.
- We are actively working internally to reproduce the two Zookeeper-related defects where (a) the Zookeeper leader responsibility was not properly transferred to a follower on a restart and (b) Order Gateways lost connection to the entire Zookeeper cluster when only a single node was restarted.
- We will be making a change in Algo servers to better handle real-time changes to the Zookeeper cluster to ensure proper cleanup of synthetic orders when cancelled after a change is made.
- We will enhance the size and scope of our scalability test environment and invest the necessary engineering resources to enable automation of a base set of cross-component load tests that run weekly across multiple versions of code.
Medium/long-term actions
We are working on a roadmap to share directly with clients which will highlight the series of changes we will be making to both our scale/load testing procedures as well as any design or architectural changes to address the instability experienced in the last few months.
Comments
0 comments
Article is closed for comments.