Billions of people rely on WhatsApp each day to communicate in realtime. Friends exchange memes, expats catch up with their families, businesses take bookings and run customer support, and teams ranging from emergency services to on-call engineers at tech companies even use WhatsApp as their primary communication tool. So when WhatsApp had an hours-long global outage on 25 October 2022, the world noticed. WhatsApp’s recent hiccup wasn’t the first high profile outage - and it won’t be the last.
WhatsApp, and live chat more generally, is one of the most visible ways that realtime experiences have become an integral part of our everyday lives. We order food and track its status in realtime, watching as riders inch closer on the map. We receive instant notifications about every type of in-play event that happens across sports like tennis. And we increasingly collaborate with our colleagues in live documents and whiteboards.
In a post-pandemic world, we rely on these realtime experiences to connect and collaborate with each other on a daily basis across work, study and play. The costs of downtime, both financially and to a brand’s reputation, rise as society’s reliance on realtime communications becomes more critical. But making them reliable is no easy task.
WhatsApp’s $22bn realtime data architecture
A leading factor in Facebook’s (now Meta) 2014 acquisition of WhatsApp was its industry-leading architecture, which could power the reliable low latency delivery of realtime data (in this case, instant chat messages) in locations where the internet connection isn’t the best.
WhatsApp is famous for having a small team. Yet it has 55 people on acquisition - a team largely focused on making sure the globally distributed architecture that underpins WhatsApp is dependable and translates into a seamless user experience.
Despite having access to some of the brightest engineering minds and almost unlimited resources, Facebook Messenger still couldn’t compete with WhatsApp’s user experience. WhatsApp won hands down every time when it came to message delivery: it was instant and messages always arrived in the order they were meant to, no matter if the sender or recipient had dropped off the internet during sending/receiving.
Coupled with WhatsApp’s viral adoption, Facebook was forced to acquire the messaging app for the eye-watering sum of $22bn. Not every company can head to the war chest and buy their way to a reliable realtime architecture, and indeed many don’t.
Considerations for delivering realtime experiences
When you decide to deliver realtime experiences to your users, you need to make sure you can deliver on today’s expectations and also have the ability to meet the higher demands that undoubtedly will come in the near future. As consumers and business app users, we’re spoiled by realtime experiences that simply work, and anything other than a seamless experience is unacceptable. Many organizations think they can simply allocate “adequate” headcount to meet this challenge. But if Facebook couldn’t compete with all of its resources and expertise, what hopes do other engineering teams have?
Perhaps the answer isn’t building in-house expertise in globally-distributed, massively dependable realtime systems. It is possible to leverage an infrastructure that provides extraordinary reliability today and that has the capability of scaling to handle whatever realtime communication needs the future may hold.
Realtime scenarios to account for
Whether you decide to build or buy a solution, it is essential for reliability that your infrastructure is designed to handle any given scenario. Here are a few cases to consider.
- Intermittent or poor quality networks: Applications like WhatsApp are accessed over the internet, which itself is an engineering challenge. Development teams must consider how their application handles connection drops, and then how they enable users to pick up from where they left off with minimal disruption. Any missed messages need to be delivered without duplicating those already processed.
- Supporting globally distributed users: People interact with WhatsApp from all over the globe which forces developers to think about how they deliver a consistent experience, with minimal latencies, to users regardless of their location. In chat-based applications, any delay in receiving messages will ultimately ruin the experience as users rely on these apps for instant communication. To architect for this, developers need to consider having geographic proximity to users through multiple Points of Presence.
- Ensuring stability and reliability: As mentioned above, users will not accept any downtime and therefore it is critical that there is no single point of congestion and no single point of failure. Developers should architect with intentional redundancy in order to handle failover scenarios where one or more component fails and for unexpected spikes in messages. Data replication to other regions is also important to optimize message delivery (see Supporting globally distributed users) and support full region failover and continuity of service in even extreme scenarios.
Avoid outages with a solution built by realtime experts
Architecting for these challenges requires experts in realtime systems, and as the Facebook-owned WhatsApp outage has shown they can affect any organisation, no matter how large. Ably has been designed from the ground-up to be available and reliable at global levels, with hard guarantees for performance and data integrity.
If you’d like to give Ably a go, create a free Ably account to get started.