Rate limiting in Distributed Systems

In the previous article System Design Concepts: Rate Limiting we saw how rate limiting could be implemented using different algorithms to limit the number of requests received by a single-server system.

In the case of multiple servers when they are distributed across different regions around the world, implementing a rate limiter for each of the servers would cause two main problems:

Inconsistency
Race Conditions

In this article, we will explore these two main problems and how we could implement a better strategy for remediating these problems for a distributed system.

Inconsistency

By definition, inconsistency in a computing system occurs when a set of instructions, rules, or processes do not agree with each other when presented with a final result of several iterations. Inconsistency could be defined in different ways for different areas of computer science.

Inconsistency in a rate limiter is a problem that arises when a large number of incessant requests from the client-side are received by the server-side in a small time frame. Even with the algorithms that we mentioned in System Design Concepts: Rate Limiting, if the server-side were to receive multiple requests from multiple clients, the rate limiters present on the server side would still exceed its throttled limits.

Race Conditions

By definition, race conditions in a computing system occur when the system tries to process two or more operations in one clock cycle, which actually must be done sequentially.

Race conditions in a rate limiter are a problem that arises when multiple requests from the client side try to access the rate limiter on the server side and increment its count. Typically incrementing of the counter present on the rate limiter happens two-step process, which actually must be done sequentially.

First, the request reads the current counter value. Second, it increments the counter value and then writes it to the counter of the rate limiter.

Between the read and write operations for a single request, multiple simultaneous write operations would cause incorrect read operations for other incoming requests.

Distributed in-memory rate limiters

Rate limiters that we have seen so far are all present on the server-side. Can we think of a way to distribute it across the different clients such that the overall global rate is still controlled by the server?

Yes, we can!

This is where we introduce distributed in-memory rate limiters. The idea here is to take the total rate limit of the rate limiter on the server side and distribute the limit equally across all the client nodes. This way all the client nodes would exercise control under their own partial limit on the outgoing requests. The total sum of the partial limits of each client would be equal to the total limit.

A system diagram for an in-memory distributed rate limiter would like as follows:

However, in real life, distributed systems are never permanently designed with a fixed number of nodes nor are they redundant fault tolerant. In the event of a failure on one of the client nodes or if the system were to be scaled by adding additional nodes, the partial limits have to be redistributed after failure or a scaling event.

To take care of this issue we will introduce another entity into the system known as a configuration server. A configuration server sits across from the client nodes acting as an entity that governs the rate-limiting distribution between the client nodes. It provides a total rate limit to each server-side system and a global rate limit that could be managed by the operator without a system restart.

Conclusion

In this article we explored how rate limiting could be implemented in a distributed system, expanding from a single-server server system. We saw that rate limiting algorithms wouldn’t suffice to control the rate limits across different client nodes and that it will lead to problems like Inconsistency and Race Conditions. To mitigate these we saw an approach to implementing the total rate limits across the different clients by distributing it partially across different nodes.

Finally, to avoid discrepancies in the distribution of the limits across different nodes in the event of a failure or scaling operation, we introduced a configuration server that handles the total rate limit and the number of client nodes in the system.