There is an entire category of problems in distributed computing that require a single client amongst a set of peers in a network to coordinate the behavior of all the other clients. It is common in systems responsible for processing queues of work that connect to downstream systems that don’t support concurrent connections. Or in systems where work is parallelized in a specific way.
For example:
- A system that monitors a file location over SFTP might have a single monitoring node that downloads a large file, then distributes a subset of rows in that file to other nodes for parallel processing.
- Systems that hold open a direct socket connection for work to be streamed down, which is then distributed or queued to other clients for processing.
- Anything that requires a single connection to a server, or other resource, to be maintained.
This pattern, where one peer is the ‘coordinator’, is popular in video games, where it’s often prohibitively expensive to run multiplayer servers. It can be far cheaper to provide a single matchmaking server for clients to discover each other and then host the multiplayer game logic in one of the peers.
The problem with – and the weakness of – these architectures happens if the coordinating client – the leader – dies and cannot recover. Without the coordinating client, the entire system will stop processing messages and fall into a “zombie” state.
A similar behavioral pattern exists in peer-to-peer networks, where specific peers are responsible for distributing state across the network to inform the other clients how to behave, or for access to data or functionality in the network of clients.
Leader election to the rescue
A popular strategy to make peer-to-peer networks reliable is to include all the client and coordinating code in every client in the network. This means that every instance of the running application is capable of running as the coordinating node.
Such a pattern is powerful because it makes scaling out a network very easy – just add more clients (or nodes) to the network and deploy the same software that’s running everywhere on the new ones.
To support this strategy, leader election is required on the network. Leader election is a design pattern that lets one of the peers be elected as the coordinating client and become responsible for sending control or orchestration messages across the network.
Leadership election enables the network of peers to understand when its coordinating client has left the network (gracefully or ungracefully), and select an alternative node to continue operations from where that client left off.
There are number of ways to implement leader election in a network:
- Racing to acquire a distributed lock on a shared resource (like a database record, or a file that is accessible to the network like an S3/Azure Blob Storage bucket)
- Ranking clients by client Ids
- Implementing a message passing algorithm where peers pass election messages around each node in the network (see: Ring Algorithm)
Each approach has pros and cons. Like most architectural choices, the technique you pick will depend on the rest of your application architecture.
If you’re dealing with file processing or similar systems, you will probably set up your clients to race to grab a distributed lock. This is often the default option and is simple to implement. Implement a thread/recurring task which is dedicated to locking a known file location. Any client that successfully manages to lock the file is elected the leader and starts coordinating. If that client dies, the lock is released, and another peer acquires the lock and continues.
In browser-based peer networks, this approach is less viable. Browser applications are typically single threaded, and don’t have access to shared, writable, storage locations. This makes client ranking a more natural fit in those environments and it’s something that can be implemented with very little code using the Ably SDKs.
How to implement client ranking using Ably
In this example, we implement a very simple application where the leader – the coordinating node – is responsible for implementing a counter, while all of the nodes, including the leader, are responsible for displaying the counter in the browser.
While not an application you’d ever sell, it is similar in architecture to peer-to-peer voting, peer-to-peer gaming, or any other application which begins when a single client is present, and ends when the last client leaves.
The demo - a counting app with leader election
In order to run the demo, you will need an Ably API key. If you are not already signed up, you can sign up now for a free Ably account. Once you have an Ably account:
- Log into your app dashboard.
- Under “Your apps”, click on “Manage app” for any app you wish to use for this tutorial, or create a new one with the “Create New App” button.
- Click on the “API Keys” tab.
- Copy the secret “API Key” value from your Root key, we will use this later when we build our app.
We use the Ably Presence API and TypeScript to build this demo. The Presence API notifies members of an Ably channel when a client joins or leaves the channel. Clients are able to set, read and update their presence data
(a string that can be set to any value, and is used to store information about the client).
We create a class that wraps the Ably JavaScript SDK, and subscribes to the SDK’s presence
events. If there is no leader already elected, when a client joins or leaves the channel, the client with the lowest clientId
is elected as the leader. When Ably is supplying the presence data, we know that even though this logic is executed on each client in the network, the ordering of clientIds is the same on each peer.
Let's start by defining a Swarm class in a file called Swarm.ts, and some member variables -
import { Types } from 'ably';
import Ably from 'ably/promises';
export default class Swarm {
public readonly id: string;
private readonly ably: Ably.Realtime;
private readonly channel: Types.RealtimeChannelPromise;
We also store an id, an instance of the Ably SDK, and the channel we're connected to as properties on an object.
Next we define some properties to store callbacks. When connection and election events occur, we'll call the appropriate callback to allow the application to trigger an appropriate action:
public onElection: (channel: Types.RealtimeChannelPromise) => void = () => {};
public onConnection: (channel: Types.RealtimeChannelPromise) => void = () => {};
public onSwarmPresenceChanged: (members: Types.PresenceMessage[]) => void = () => {};
The constructor takes a channelName
to connect to, and generates an id for this client. In this implementation, we create a new Ably instance, and connect to the channelName
which was given, passing a generated clientId
.
Please note, in a real application you should use token authentication rather than using an API key directly in your code, for security reasons.
Once we've created an Ably instance, we're going to connect and store a channel instance to this.channel:
constructor(channelName: string) {
this.id = Math.floor(Math.random() * 1000000) + "";
this.ably = new Ably.Realtime({ key: "api-key-here", clientId: this.id });
this.channel = this.ably.channels.get(channelName);
}
The connect method is going to subscribe to the presence
event on the channel
, passing the onSwarmPresenceChanged
callback to be called when an event occurs. Once subscribed, we're going to enter presence
, with a state of "connected", and call the onConnection
callback:
public async connect() {
this.channel.presence.subscribe(this.onPresenceChanged.bind(this));
await this.channel.presence.enter('connected');
this.onConnection(this.channel);
}
The onPresenceChanged
callback will be called when a client joins or leaves the channel. Whenever this happens, we retrieve the latest full presence set from the channel instance (which is cached), trigger the election process, and finally call the onSwarmPresenceChanged
callback.
private async onPresenceChanged() {
const members = await this.channel.presence.get();
this.ensureLeaderElected(members);
this.onSwarmPresenceChanged(members);
}
The election logic is implemented in ensureLeaderElected
. We use the members array to determine if a client is the leader. First, it checks the members
array passed to the function, and if there is already a leader (someone with the presence data of "leader"), it will stop and return early. There's already a leader, so nothing further needs to be done.
private async ensureLeaderElected(members: Types.PresenceMessage[]) {
const leader = members.find(s => s.data === "leader");
if (leader) {
return;
}
...
If there is no leader, we sort the members array alphabetically, using the clientId
property of each PresenceMessage
:
...
const sortedMembers = members.sort(
(a, b) => (a.connectionId as any) - (b.connectionId as any)
);
if (sortedMembers[0].clientId === this.id) {
await this.channel.presence.update("leader");
this.onElection(this.channel);
}
}
If the first member in the members
array is the current client, we update its presence
data to "leader"
and call the onElection
callback. Updating the presence data will flag this client as the leader, and the calling code uses the onElection
callback to trigger application logic appropriately.
Remember, this code is running across every client in the swarm, but only the first client when sorted by clientId
will elect itself the leader.
How to build a distributed counting application using the Swarm class
This demo application uses ES Modules and Vite as a development experience. The application comprises four files:
./app/Swarm.ts ./app/index.html ./app/script.ts ./app/style.css
We've already worked through through Swarm.ts, so next we will build out index.html:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Example</title>
<script src="./script.ts" type="module"></script>
</head>
...
Notice the reference to ./script.ts
uses type="module"
in the script tag. ES Modules makes it possible for the browser to import script.ts
as a module in the front end. (This means that we can use browser native import statements in the JavaScript code).
<body>
<h1>Leader Election Demo</h1>
<p>Client Id: <span id="client-id"></span></p>
<p>Leader?: <span id="leader"></span></p>
<p>Current Counter Value: <span id="counter-value"></span></p>
<div id="presence-members">
Waiting to load presence....
</div>
</body>
</html>
The UI of this demo is very simple - a couple of paragraphs that display the current browsers ClientId
and whether or not this client is the leader.
It also displays a current counter value, and a list of all the clients in the swarm.
script.ts
All of the application logic is in script.ts.
We start off by importing dependencies and using document.getElementById to get a reference to the client-id, counter-value, leader and presence-members elements.
import { Types } from "ably";
import Swarm from "./Swarm";
let counter = 0;
const counterUi = document.getElementById("counter-value");
const clientIdUi = document.getElementById("client-id");
const leaderUi = document.getElementById("leader");
const presenceUi = document.getElementById("presence-members");
There are two interesting things to note here:
- we're using
import
to import both the Ably SDK types and our ./Swarm class from earlier. Vite will ensure this works just fine. - we're initializing a variable called
counter
to0
. This is the application state.
The Swarm
class will connect to some-channel-name
, and initialize the UI with the current ClientId
and a "false"
value for leaderUi
:
const swarm = new Swarm("some-channel-name");
clientIdUi.innerText = swarm.id;
leaderUi.innerText = "false";
Now that the UI is set up, we need to attach some functions to our Swarm
callbacks.
We assign a callback function to onSwarmPresenceChanged
. When users join or leave the channel, it will update the UI with a stringified version of the members array.
swarm.onSwarmPresenceChanged = (members: Types.PresenceMessage[]) => {
presenceUi.innerText = JSON.stringify(members);
};
We also assign a callback function to onElection
. This will update the UI to show that the elected client is the leader.
In addition, when the leader is elected leader, an interval is created that increments the counter
variable every five seconds. As the counter is incremented, a message is published to all the other peers in the swarm with the updated counter value.
swarm.onElection = (channel: Types.RealtimeChannelPromise) => {
leaderUi.innerText = "true";
setInterval(() => {
counter++;
channel.publish("message", { counter: counter });
console.log("Sent counter", counter);
}, 5000);
};
To ensure that the counter
value is updated in the UI, we provide an onConnection
callback that subscribes to messages called "message"
.
When a message arrives, we first update counter
with the value from the message body, and then we set the innerText
of the counterUi
to the updated value.
swarm.onConnection = (channel: Types.RealtimeChannelPromise) => {
channel.subscribe("message", (message: Types.Message) => {
counter = message.data.counter;
counterUi.innerText = counter.toString();
});
};
Finally we connect to the swarm:
swarm.connect();
export { }; // required for vite to import the file
It is worth noting that the onConnection
callback executes for both the leader
and the follower
clients. If you're observant, you’ll have noticed that the counter
value is updated in both cases. For this example, it doesn't matter that the counter value is updated twice on the leader and once on the followers.
Try it out
When you run this application in your browser, and the first client joins, it will be elected the leader and start incrementing the counter.
Subsequent clients that join receive the counter messages and update their UI. If the leader closes their browser, the election process will run and a new leader is elected to take over updating and distributing the counter messages from where the last leader left off.
Run the sample application
You can clone the repository here to run the application. In order to run this demo you will need node and npm.
npm install npm run start
Then open the browser to http://localhost:8000.
You'll need to edit Swarm.ts to provide an Ably API Key of your own for the example application to work.
Further Reading
- Visualize your commits in realtime with Ably and GitHub webhooks
- Ably + React Hooks NPM package now out
- Building a realtime chat app with Next.js and Vercel
- WebSockets and Node.js - testing WS and SockJS by building a web app
- Build your own live chat web component with Ably and AWS
- Building a realtime SMS voting app