This short post by a member of Ably's engineering team describes how we resolved a problem that is typical of the challenges we face each week. We thrive on solving hard distributed system problems that are mostly platform agnostic and theoretical in nature, and this is the first post in a long-term series of articles about things we've learned recently.
How we use Redis at Ably
Ably is a platform for pub/sub messaging. Publishes are made on named channels, and clients subscribed to a given channel have all messages on that channel delivered to them. We use Redis, a distributed in-memory database for key-based storage, to store various entities such as authentication tokens and ephemeral channel state. It’s a good fit for temporary storage of messages while we process them.
We have billions of active Redis keys at any given time, which are sharded across numerous Redis instances. The sharding strategy places related keys in the same shard so that we can perform operations that update related keys atomically. We use Lua Redis scripts extensively to query and update keys and rely on the atomicity of script execution to preserve the integrity of values of related keys. That is, either all commands in the script run, or none at all run, and no other commands execute at the same time.
We also use expiring keys extensively; the nature of the Ably service is that much of the state of a channel is ephemeral and only retained for a limited period of time (typically 2 minutes). We set keys to have a TTL so they auto-expire.
The issue
The integrity of a set of related keys requires that either all keys exist, or none exist. We had assumed that the atomic nature of script execution would also apply to expire operations invoked by a script, but it isn't in fact true that naively expiring multiple keys in the same script will preserve that integrity.
While expire operations execute atomically within the same script (with no opportunity for intervening operations to occur), nonetheless the timestamps associated with each expire operation are not necessarily the same.
Running TIME
shows two different values:
-- time.lua local a = redis.call('time') local b = redis.call('time') return {a, b}
$ ./redis-cli --eval /app/time.lua 1) 1) "1638280442" 2) "996960" 2) 1) "1638280442" 2) "996966"
Checking the actual expiry time:
-- expire_check.lua redis.call('set', 'foo', '1') redis.call('expire', 'foo', 1) -- slow calls... redis.call('set', 'bar', '2') redis.call('expire', 'bar', 1) local fooExpiry = redis.call('PEXPIRETIME', 'foo') local barExpiry = redis.call('PEXPIRETIME', 'bar') return {fooExpiry, barExpiry}
$ ./redis-cli --eval /app/expire_check.lua 1) (integer) 1638280843717 2) (integer) 1638280843730
The expire might not be pin-point accurate, and it could be between zero to 1 milliseconds out.
The implication is that there could be times at which some keys have expired, but other related keys have not and this could lead to an inconsistent state.
Our solution
The solution is to use EXPIREAT
to set an absolute expiry time for all related keys, rather than rely on a relative expiry time through the TTL.
The Redis documentation is not clear if multiple key expiry is guaranteed to occur at the same time if keys have the same EXPIREAT
setting. To be cautious, we reordered key expiry to ensure that, regardless, we avoid inconsistency.
-- expire_new.lua -- Unix time local now = redis.call('time')[1] local expiry = now + 1 redis.call('set', 'foo', '1') redis.call('expireat', 'foo', expiry) -- slow calls... redis.call('set', 'bar', '2') redis.call('expireat', 'bar', expiry) local fooExpiry = redis.call('PEXPIRETIME', 'foo') local barExpiry = redis.call('PEXPIRETIME', 'bar') return {now, fooExpiry, barExpiry}
$ ./redis-cli --eval /app/expire_new.lua 2) (integer) 1638281266000 3) (integer) 1638281266000
This is typical of one of the many engineering problems we troubleshoot and solve each week here at Ably.
Fancy working with us in the realtime sphere? Our engineers have a range of broad technology skills across infrastructure, security, distributed systems, and beyond.
You can find us on Twitter or LinkedIn, and apply to join us in one of our open roles.
| Discuss this post on Hacker News |
Latest from Ably Engineering
- Stretching a point: the economics of elastic infrastructure ?
- A multiplayer game room SDK with Ably and Kotlin coroutines ?
- Save your engineers' sleep: best practices for on-call processes
- Squid game: how we load-tested Ably’s Control API
- How to connect to Ably directly (and why you probably shouldn't) – Part 1
- Migrating from Node Redis to Ioredis: a slightly bumpy but faster road
- No, we don't use Kubernetes