Over the past few months, since we successfully delivered the realtime data for the Australian Open to millions of fans, we have been chatting with many of our customers and prospect technical teams in the betting industry about the lessons we learned and how they can apply them to their own architecture.
As a firm believer and contributor to open source, in this article I explore some common realtime problems we see for betting companies and share some proposed architectural design patterns that I hope will be useful for other developers.
Problem 1: Theoretically limitless and often unpredictable number of live update recipients
Most often a stream of events arrives at the app developer’s servers from a sports data provider (such as Opta or Sportradar), some business logic is applied, and the data then needs to be pushed to every participant's device.
For example, when a goal is scored, you may want every user of your service to receive an update within 200 milliseconds anywhere on earth. The challenge is that you want that target latency to be met whether you have just one or millions of users, without changing your architecture as the volume increases.
Pattern 1# — Pub/Sub
Using an event-driven architecture built on the pub/sub pattern is hardly new, it’s been around since 1987, and it's a good way to approach asynchronous realtime data distribution. The pattern involves two roles, Pub represents the publisher, Sub represents the subscriber. The pattern specifies that the publisher (your server typically) publishes messages without any knowledge of how many subscribers there are. Subscribers register to receive a message without any knowledge of the publishers. This ensures the two roles are intentionally decoupled with a middleware broker responsible for receiving messages and fanning out the messages to the subscribers.
How does this help?
Simplicity: It ensures your apps scale with the number subscribers without needing any changes to the design of your system. The broker is responsible for scale, and if you chose to adopt a middleware pub/sub solution it should address that need in a predictable way.
Our customers use Ably’s Pub/Sub feature as a middleware broker to provide limitless scale. Ably is elastic by design: it's part of the four pillars of dependability at the heart of our system. We offer transparent, mathematically grounded design for extreme scale, elasticity, and service uptime with 50% capacity margin for instant surge and 99.999% uptime SLA. Ably’s roundtrip latencies globally are <65ms in 99% percentile. Developers trust us to look after the scale issues.
Problem 2: Data synchronization
Problem 1 describes how data is distributed to devices, but it does not address how you keep data in sync consistently across all devices.
For example, your app may need to maintain live prices. As every event occurs during the game, your app needs to reflect that change in real time both in the UI and also within the local storage. The challenge is one of data integrity and bandwidth. If you publish the entire price set for every line, it could be hugely inefficient and will result in significant bandwidth load on your users’ devices. Perhaps, more importantly, this could impair the user experience for people with slower connections or with expensive bandwidth. If however you only send data updates, how do you ensure that the data integrity is maintained i.e. you need all updates arrive reliably and in order?
Pattern 2.1# — Serial JSON Patches
JSON Patch is a standard that defines how you can send only the deltas for a JSON object as it mutates. For example, if you had a table of all players with their stats, and only one player’s stats changed following a goal being scored, then the patch may look something like:
[ { "op": "replace", "path": "/player/bob/goals", "value": "1" },]
How does this help?
JSON Patch provides a means to efficiently send deltas for a JSON object thus reducing bandwidth overhead significantly. However, JSON Patch does not provide the complete solution as:
- You need to obtain the JSON object at the outset
- The JSON Patches must be applied in the exactly the order they were generated — a missing or out-of-order patch will result in complete loss of data integrity
Ably offers reliable delivery: data integrity is another of the four pillars of dependability at the heart of our system and message ordering and delivery are guaranteed from publisher to subscribers.
We also provide a message history (replay) feature providing a means to obtain historical message published on the channel prior to connecting. Finally, we uniquely offer continuous history ensuring developers can reliably obtain history and receive subsequent realtime updates without any missing messages or duplicates.
A pattern we’ve seen developers use with Ably to solve this problem therefore is:
- Configure messages to be persisted
- Publish the original JSON object on the channel, and then subsequently all JSON Patches
- Clients when connecting then obtain the channel history and subscribe to future JSON patches. The history provides a means to build the JSON object from the initial object plus all the patches, and the attached channel ensures live updates continue to be received in order with integrity.
- If a client loses continuity on the channel (this may happen if the client is disconnected for more than two minutes), the app simply repeats the previous step.
Note: At Ably, we are driving forward the development of an open standard Open-SDSP (Open Streaming Data Sync Protocol) to help solve these types of synchronization issues. I have previously written thoughts on why this is needed, and how this open standard could benefit the industry.
Pattern 2.2# — CRDT
A CRDT is a conflict-free replicated data type. Unlike JSON Patch, it allows multiple parties to concurrently update the underlying data object. Each update is then distributed to all other parties, and the algorithm ensures that once all updates are applied by all participating parties, the underlying data object will become eventually consistent, regardless of the order the updates are applied.
CRDTs offer a sophisticated way to ensure data is consistent, even when there are multiple parties changing the data at the same time. However, in order to provide the eventual consistency guarantees, there are limited data types with specific restrictions.
To date, CRDTs are more commonly used in collaborative editing applications where multiple users can update the content simultaneously. Riak is one of the few database solutions that provides native CRDT support. I will leave it to the reader to consider whether CRDT is appropriate for their use case and how best to implement it.
Problem 3: Upstream gameplay events
Problems 1 and 2 address issues associated with scaling downstream data to your devices. However, participating in betting games may have an upstream component i.e. users on devices may place bets, participate in a chat, or collaborate with team members.
Typically this is handled with a simple HTTP request to your servers which in turn run some business logic which may respond synchronously (as part of the response to the HTTP request), or asynchronously (pushed back to the device later as a message).
The problem app developers face are:
- Using an HTTP request in a synchronous fashion increases the likelihood that the operation fails in changing network conditions. For as long as the client is waiting for a response, there is a chance the connection state will change and the underlying TCP connection for the HTTP request will be closed. If a request has failed due to a TCP connection, it may need to be retried, but unless the operation is idempotent, it could result in unexpected behaviour for users (such as placing two identical bets).
- If there is a sudden spike of activity, perhaps due to an unforeseen change in the match such as a player being taken off, then your servers will need to absorb that load immediately. You need to predict the load in advance of each game and ensure you have sufficient service capacity for the spikes.
- HTTP provides no ordering guarantees i.e. a later request may arrive before an earlier request when they are close together.
Pattern 3# — Message queues, serverless architecture, and streams
Traditional message queues are designed to provide a message-orientated means to communicate between components of the system, specifically the devices and your servers in this instance. Like the pub/sub middleware, the message queue acts as a broker distributing messages to each consumer. If your messaging middleware is scalable and resilient, then decoupling each component in the system ensures failure or overload in one area of the system will not affect any other part of the system.
Please note, unlike the pub/sub pattern where each message is delivered to all subscribers, message queues typically deliver messages only once to one of the consumers (technically AMQP can only provide an at least once guarantee, however in almost all situations it provides exactly once delivery). Messages typically operate with FIFO (first-in-first-out) ensuring that in spite of any backlog that builds up if consumers cannot keep up, the consumers will always be working on the oldest messages first.
How does this help?
By decoupling publishers of data (your devices in this instance) from the consumers of the data (your servers), and introducing a message queue middleware broker, you are building a fault tolerant system. For example:
- Huge spikes of activity may slow down the responsiveness of customers as the workers have to work through the backlog of messages to process, but customers will not experience failures.
- Where the ordering of events from device to server is important, the message queue is able to provide ordering guarantees. Ably provides reliable ordering all the way through to the queue.
- Messages from devices are streamed immediately into the queue and the device receives an ACK immediately notifying the device that the message has been received. As a result, if you use Ably for publishing, we can ensure a message is never published twice (i.e. if you publish a message, lose connection before you receive an ACK, and then retry the message once connected, we de-dupe that message).
Note: Ably provides a number services to ensure scale when processing upstream events:
- Reactor Serverless Functions allow a stream of events to trigger serverless functions on AWS, Google or Azure who provide the scale you need,
- Reactor Message Queues provide a hosted AMQP/STOMP queue to consume realtime data
- Reactor Firehose allows realtime data to be streamed into any number of third-party streaming or queueing services such as AWS Kinesis, Kafka, AWS SQS, RabbitMQ etc.
Ably's Four Pillars of Dependability
We created the Four Pillars of Dependability because we understand that transferring data in realtime is vital for a seamless user experience.
Performance: We focus on predictability of latencies to provide certainty in uncertain operating conditions, offering <65ms roundtrip latency in 99% percentile and unlimited channel throughput.
Integrity: We offer guarantees for ordering and delivery to overcome limitations of pub/sub & simplify app architecture. Message ordering and delivery are guaranteed from publisher to subscribers.
Reliability: Fault tolerant at regional and global level so we can survive multiple failures without outages (99.999999% message survivability for instance and datacenter failure).
Availability: A transparent, mathematically grounded design for extreme scale, elasticity, and service uptime with 50% capacity margin for instant surge and 99.999% uptime SLA.
In this article, I have touched on three common realtime challenges that we see for betting app developers. There are many more and I’d be happy to discuss them. Please get in touch if you’d like to discuss this article, your realtime challenges or have any other questions.
This article was first published on 10th April 2018, and was updated in January 2022 to add new links and information about our Four Pillars of Dependability.