Interview Preparation: Designing A Scalable Notification Service (Beginners)

How To Scale A Notification Service

Oct 30, 2021

Hello there 👋 ! We are growing fast and are now at 500+ subscribers 💪 💪. Thank you for your love and support ❤️ ❤️. I am enjoying it, I hope you are enjoying it too!

As a part of this newsletter, I promised to curate a list of questions that will help you prepare for interviews, apart from focussing on fundamental ideas. I will keep them under the “Interview Preparation” tag and share these regularly. In the previous blog, we discussed the design of a scalable Search Engine. Here is the next one, on designing a scalable notification service. Looking forward to hearing more ideas about what you would want to read/learn about in the comments. Also, in case there are improvements that you would want to suggest, I will be happy to incorporate those.

Introduction

Let us first start with a definition of a notification service. A “notification service” is a general-purpose service that is used to send notifications to its users. There are several use cases of notification service. The primary use case tends to be setting up an alert based on a specific event in the system. Let us take a look at a few concrete use cases.

Use Cases

E-Commerce:
- Notifying the user when an item becomes available.
- Notifying the user when an item’s price increases or decreases significantly.
- Notifying the user when an item has a discount running on it.
Cab-rides:
- Notifying the user when a cab ride is booked or canceled.
- Notifying the user when a credit card transaction goes through or is declined.
Software development:
- Notifying the user when a service is down. Typically used by APMs (such as AppDynamics, NewRelic).

Problem Statement

For the scope of this blog, let us take the example of an application like Uber, where the user is updated when a cab ride gets booked. Here is the high-level view of the data flow of the application:

The workflow is as follows:

The user sends a request to book a cab using his or her application.
The API then hits the Ride Reservation Service. The Service then reserves a cab from the list of available cabs and sends back a confirmation to the app.
The application relays the confirmation back to the user.

If the cab driver cancels the cab, then the following workflow is triggered.

A cancellation request is made to the Ride Reservation Service.
The Ride Reservation Service relays the notification to the Notification Service.
The Notification Service then relays triggers a cancellation to the Uber App.
The Uber App then relays the notification back to the user.

Functional Requirements

The “notification service” should send an email or a push notification to users if there is a cancellation by the cab driver or the rider.
The “notification” should also contain some metadata, i.e., the reason for cancellation and any additional charges levied.

The “notification service” should be agnostic of the client, i.e., it should be able to send the notification whether the client is a Web App, Mobile App, or an API end-point.

Non Functional Requirements

The “notification service” should be independently scalable, deployable, durable, available, secure, and maintainable. Let us look at each in detail:
- Scalable: Should support sending tens of million messages per hour.
- Available: Should survive hardware/network failures. It should not have a single point of failure.
- Performant: Should keep end-to-end latency as low as possible. The message should get delivered with five seconds to the user.
- Durable: Should not lose the message, and the message should be delivered at least once to the user.
- Maintainable: It should be easy to debug and look at metrics such as how many notifications were delivered, maximum latency of delivery, the maximum load on the service, etc.
- Deployable: Upgrading the notification service should not disrupt any other service.
- Secure: Should not send updates of one rider to a rider. The same holds for a driver as well.

Basic Components

Before we get into the High-Level Architecture design, let us discuss some of the fundamental ideas required to understand the system.

Publisher-Subscriber Model

In software design, publish-subscribe is a messaging pattern where senders, called publishers, do not program the messages to be sent directly to specific receivers called subscribers. Instead, they categorize the messages into groups (topics) into intermediate messaging subsystems (e.g., a messaging queue). The messaging queue is agnostic of the semantics of the messages. Examples of messaging queues are Rabbit MQ, Apache Kafka, etc.

High-Level Architecture

We will now discuss the components of the design in detail.

Load Balancer

A load balancer sits in front of the notification handler service to balance the load on the downstream notification service and handle routing in case of downstream failures.

Notification Handler Service

The Notification Handler Service is a lightweight web service. It is stateless and can be deployed across several data centers. Some of its functions are:

Request validation
Authentication and authorization
SSL Termination
Caching
Rate Limiting & Throttling
Request Dispatching
Metric Collection
Encryption of responses

The frontend service host consists of the following sub-components:

Reverse Proxy: The reverse proxy handles the incoming request, terminates the HTTPS connection, and forwards them to the Notification Web-Server. Some of the popular reverse proxy servers are NGINX and HAProxy.
Web Service: The Web Service encodes the logic to handle the incoming request, fetch the topic for the message from the Metadata Service and then pass it on the temporary storage. It also performs request validation, auth checks to ensure that the request has the necessary privileges.
Local Cache: The Web Service may also keep a local cache so that it does not have to visit the Metadata Service for every call to get the topic of the message. It can be an in-memory cache and built using Guava or Memcached (as a separate process).

Service Logs Agent: The webservice also generates many logs. These logs consist of exceptions, warnings, metrics, and audit trails. To capture these logs, usually, we have an agent injected into the service. These agents collect logs and send them to a central location such as Splunk and ELK. The log search platforms make it easy for developers to find issues and debug them.
Metrics Agent: A set of metrics are injected into them as well to monitor services. Some of these metrics include the following:
1. The number of notification requests to the webserver
2. The throughput of notification requests
3. The number of valid/invalid notification requests
4. Latency distribution of the notification requests.

A metric agent will capture these metrics and send them to an APM (e.g., AppDynamics, and NewRelic).

Audit Logs Agent: Similar to the metrics, logs are also collected to audit the system. These may be collected and stored in a Database using an Audit Log Agent.

Metadata Service

The Metadata Service stores a mapping between a topic and its corresponding owner. Primarily, it acts as a caching layer between the Notification WebServer and the Database.

System Requirements:

The cache should be able to support high reads, low writes.

Solution:

There are multiple ways to implement it. We can use any scalable cache such as Redis, Memcached, and DynamoDB like the Metadata Service.

Metadata Database

The Metadata Service is backed by a relational database that stores information corresponding to each topic ID. Some of the metadata that has to be stored with each topic is the list of subscribers for each topic, configurations such as rate-limit per topic, SLAs, etc., defined for each topic. The data has a relational structure to it, and therefore, a relational DB would suffice.

Temporary Storage

The most critical piece of the system is a temporary storage layer that sits between the Notification Service and the Subscribers.

System Requirements:

The temporary storage should support the durable storage of messages for a small period.
The temporary storage should be scalable as the number of messages to be read and written will be very high.

There are several options to consider. Let us discuss each of the following possibilities in detail.

Databases: We can go with either a SQL or a NoSQL database. System requirements:

Large scale.
Do not require ACID semantics or transactional semantics.
The database must be highly scalable for reads and writes, highly available, and tolerate network partitions.
Do not require highly complex analytical queries.
Do not require data to be stored for analysis or data warehousing.

A No-SQL database such as a key-value store or columnar data store is better. Examples of such databases include Cassandra or Dynamo DB.

In-memory caches:

Another option to store the intermediate messages is keeping them in in-memory caches such as Redis and Memcached. Performance of reads and writes is an advantage of in-memory caches. The disadvantage tends to be durability and correctness semantics. Disk-backed caches such as Redis do solve the problem of durability. However, the notification service will have to implement some of the correct features as message delivery guarantees, only once semantics, etc.

Messaging Queues:

The last option we will discuss is messaging queues. Some of the standard messaging queues are Apache Kafka, Amazon’s SQS, and Rabbit MQ. Typically, a messaging queue provides the following guarantees:

Permanent Storage
High Availability
High Throughput
Scalability

Apache Kafka:

Apache Kafka is designed on an abstraction of a distributed commit log. It is a distributed event stream processing platform and can handle trillions of events in a day. It supports the “Publish-Subscribe” model. The underlying design is an immutable commit log, from where publishers can write messages and the subscribers can and read messages from. Some good resources to go deeper into are Kafka’s original paper and blog about how LinkedIn uses Kafka for 7 trillion messages per day.

Compared to RabbitMQ, Kafka has a much more performant platform, has a more significant community around it, and is much more widely adopted.

In comparison to Amazon’s SQS, Kafka being cloud-agnostic, has a much higher value in being used.

Given that a message queue satisfies most of our requirements and plays well with our pub-sub model, it is good to go with the best message queue platform out there, i.e., Apache Kafka.

Message Sender Service

The primary role of the message sender service is to read messages from the shared message queue and send them to the downstream subscribers. Let us look at the design in more detail.

Message Retriever

The Message Retriever module gets messages from the message queue. Once the message is read from the message queue, it is passed on to the task creator to dispatch the message downstream to the list of subscribers that consume the message.

Apart from reading the messages, it also can throttle the number of messages being read so the Sender Service may not get overwhelmed by the sender service.

Metadata Service Client

The Metadata Service Client gets the metadata information that is associated with each message. We make a separate call instead of passing the metadata information along with the message itself. The rationale for this design is to reduce the size of the payload that the service has to pass around with the message.

Task Creator + Task Executor

Once the client receives the message and its metadata, the task creator will translate it into a task. The task will contain information about the message, a unique identifier, a list of all the subscribers, the total number of subscribers, timestamp, etc. The task is a wrapper defining the message. A thread pool such as the task executor is then used to send the message to subscribers. Java provides a pre-defined implementation of a ThreadPool. The push to subscribers may be using an email, SMS, IM (e.g., WhatsApp), API end-point, and a Phone Call.

Non-Functional Requirements

Is the system architecture scalable? ✅

Explanation: Each of the components of the architecture can independently scale. It includes Notification Handler, Metadata Service, Temporary Storage, and Message Sender Service. This architecture will scale horizontally.

Is the system architecture highly available? ✅

Explanation: Each of the services has redundancy built-in and can be replicated across data centers. None of the services expose a single-point-of-failure.

Is the system architecture performant? ✅

Explanation: Each of the services can scale horizontally. Besides, the caches (Metadata Service and Temporary Storage) are built keeping in mind performance. The Metadata Service is built on top of Redis, Memcached (which are performant services), and the Temporary Storage is built using Apache Kafka, which is highly scalable.

Is the system architecture performant? ✅

Explanation: The messages in the message queue are persisted on disk and are durable. Similarly, the Metadata Service built on Redis is durable and backed by a Database, making data loss very low.

Challenges To Consider

We have designed a functional, scalable, available, and performant system. There remain several practical challenges that we will need to consider. We will consider these in a follow-up blog.

Thresholding Spammers
Duplicate messages
Message Delivery
Monitoring (Alerting)
Message Ordering
Security

We will discuss these challenges in a follow-up blog, an Advanced Form Essay On Notification Service Design. Furthermore, we will also discuss the design of a scalable message queue (e.g., Apache Kafka).

References

Cheers,
Ravi.

Satoshi Kimura

Aug 12, 2023

Hi Ravi, thanks for the informative post. I had a question related to using Kafka. Where in the high level design does the Kafka component fit in? Is it between the temporary storage and the sender?

Expand full comment

1 reply by Ravi Tandon

Shreyas Shubham

May 3, 2022

Really informative blog. The one thing I like about your blogs is, it's easy to grasp for beginners like me.

2 more comments...

Ravi's System Design Newsletter

Discussion about this post