Understanding Scalability in System Design

Modern systems rarely fail because they are badly written.

More often, they fail because they cannot handle growth.

A system that works perfectly with 1,000 users may completely collapse when the user base reaches 1 million. This is why scalability becomes one of the most important topics in system design.

In this article, we'll understand what scalability means, how engineers measure system load, and the different ways systems grow to handle increasing demand.

What is Scalability?

In simple terms, scalability is the ability of a system to handle growth.

Growth can happen in different ways:

More users are joining the system.
More requests are being sent to the system.
More data is being stored.
Higher traffic during peak hours.

A scalable system should be able to continue performing well even when demand increases.

For example:

Imagine an e-commerce platform during a festive sale.

If the number of users suddenly increases from 10,000 to 1 million, the system should still:

process order
update inventory
show product pages quickly

If it cannot handle this increase, the system is not scalable.

Understanding System Load

Before we talk about scaling, we must first understand what kind of load the system in handling.

Different systems measure load in different ways.

Some common load parameters include:

Requests per second
Number of active users
Read vs write operations
Cache hit rate
Amount of stored data

These metrics help engineers understand where the system is under pressure.

For example:

A video streaming platform may focus on bandwidth and concurrent users
A messaging app may care about messages per second
A social network may track timeline requests per second

Understanding the correct metric is the first step toward designing scalable systems.

Real-World Example: Twitter (Now X) Timeline Problem

One of the most famous scalability challenges comes from social media platforms like Twitter.

Two common operations happen on such systems:

Posting a tweet
Viewing a user's timeline

At first glance, posting tweets seems simple. But the real challenge lies in distributing that tweet to millions of followers.

Let’s look at two different ways to design the timeline system.

Approach 1: Compute Timeline When User Reads

In this design, the system calculates the timeline only when the user opens it.

Steps:

Find all users the current user follows
Fetch their recent tweets
Merge and sort them

Example query:

SELECT tweets.*, users.*
FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user

Advantage

Writes are cheap because the system only stores tweets once.

Problem

Reads become very expensive.

If millions of users open their timelines at the same time, the system must run millions of complex queries.

This approach struggles when the read traffic is extremely high.

Approach 2: Precompute Timeline When Tweet is Posted

Another design flips the logic.

Instead of computing timelines when users read them, the system prepares the timeline when a tweet is created.

Steps:

User posts a tweet
The system copies that tweet into the timeline of each follower

Now when a user opens their timeline, the data is already prepared.

Advantage

Reading timelines becomes extremely fast.

Problem

Writes become expensive.

Imagine:

Average user has 75 followers
If 4,000 tweets are posted per second

The system now performs:

4,000 × 75 = 300,000 writes per second

For celebrities with millions of followers, a single tweet could generate millions of database writes.

Hybrid Design Used in Practice

Real systems rarely use just one approach.

Instead, they combine both.

Typical strategy:

Normal users → Fan-out on write
Celebrities → Compute on read

This reduces the load caused by huge follower counts.

The key takeaway here is:

Scalability solutions depend heavily on usage patterns.

There is no universal design that works for every system.

Measuring System Performance

When traffic increases, engineers usually ask two questions.

Question 1

If load increases but resources stay the same,
how does system performance change?

Question 2

If load increases,
how many additional resources are needed?

To answer these questions, we measure system performance using two main metrics.

Throughput

Throughput measures how much work the system can process.

Example:

records processed per second
tasks completed per minute

Throughput is commonly used in batch processing systems like data pipelines.

Response Time

Response time measures how long a user waits for a response.

This includes:

processing time
network delay
waiting in queues

In most web systems, response time is the most important user-facing metric.

Latency vs Response Time

People often mix these terms, but they are slightly different.

Latency

The time a request waits before processing starts.

Response Time

Total time from request to response.

Response Time = Latency + Processing Time + Network Delay

Users care about response time, because that represents how long they actually wait.

Why Average Response Time is Misleading

Many engineers make the mistake of measuring average response time.

But averages hide slow requests.

Example:

If most requests take 100 ms but a few take 5 seconds, the average may still look fine.

However, those slow requests create a bad user experience.

This is why engineers rely on percentiles.

Understanding Percentiles

Percentiles show how slow the worst requests are.

Common metrics include:

Percentile	Meaning
50th	Median response time
95th	Slow requests
99th	Very slow edge cases

Large tech companies often monitor the 99th percentile latency to ensure even rare slow requests are under control.

The Tail Latency Problem

Modern systems often depend on multiple services.

Example:

A single request may involve:

authentication service
recommendation engine
database
payment service

The overall response must wait for the slowest service.

This problem is known as tail latency amplification.

Even if most services are fast, one slow component can delay the entire request.

Methods to Handle Growing Load

Once engineers understand the load, they decide how to scale the system.

There are two common approaches.

Vertical Scaling (Scale Up)

This means upgrading a machine with more resources.

Example:

more CPU
more RAM
faster disks

Advantages

Simple to implement.

Limitations

Machines cannot grow infinitely.
Eventually, hardware upgrades become extremely expensive

Horizontal Scaling (Scale Out)

Instead of upgrading one machine, the system adds more machines.

The workload is distributed across multiple servers.

This architecture is often called shared-nothing architecture, because each machine works independently.

Advantages

Can support very large systems.

Challenges

More operational complexity.

Hybrid Scaling in Real Systems

Most real systems combine both strategies.

For example:

a few powerful machines
combined with distributed clusters

This allows systems to handle both heavy workloads and large data volumes.

Elastic Scaling vs Manual Scaling

Scaling can happen automatically or manually.

Elastic Scaling

Infrastructure automatically adds or removes servers depending on traffic.

Common in cloud platforms.

Manual Scaling

Engineers decide when to add servers.

Simpler but slower to respond to sudden traffic spikes.

Stateless vs Stateful Systems

Scaling also depends on whether a service is stateless or stateful.

Stateless Services

These services do not store user data locally.

Examples:

API servers
web servers

They are easy to scale — just add more instances.

Stateful Systems

These store persistent data.

Examples:

databases
storage systems

Scaling them requires data partitioning or replication, which adds complexity.

Final Thoughts

Scalability is not about predicting the future perfectly.
It is about designing systems that can grow when needed.

Good scalable systems start by understanding:

system load
usage patterns
performance metrics

There is no universal architecture that works everywhere.

Each system must be designed based on how users interact with it and how the workload behaves.

In the next part of this series, we’ll explore another critical property of good systems — Maintainability.

Understanding Scalability in System Design

What is Scalability?

Understanding System Load

Real-World Example: Twitter (Now X) Timeline Problem

Approach 1: Compute Timeline When User Reads

Advantage

Problem

Approach 2: Precompute Timeline When Tweet is Posted

Advantage

Problem

Hybrid Design Used in Practice

Measuring System Performance

Question 1

Question 2

Throughput

Response Time

Latency vs Response Time

Why Average Response Time is Misleading

Understanding Percentiles

The Tail Latency Problem

Methods to Handle Growing Load

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Hybrid Scaling in Real Systems

Elastic Scaling vs Manual Scaling

Stateless vs Stateful Systems

Final Thoughts

Comments

More from this blog

Hypothetical Document Embeddings (HyDE): Smarter Retrieval in RAG

Reciprocal Rank Fusion: Making RAG Retrieval Smarter

Chain of Thought in RAG: Making Queries Smarter, Not Harder

Understanding Maintainability in System Design

Command Palette

What is Scalability?

Understanding System Load

Real-World Example: Twitter (Now X) Timeline Problem

Approach 1: Compute Timeline When User Reads

Advantage

Problem

Approach 2: Precompute Timeline When Tweet is Posted

Advantage

Problem

Hybrid Design Used in Practice

Measuring System Performance

Question 1

Question 2

Throughput

Response Time

Latency vs Response Time

Why Average Response Time is Misleading

Understanding Percentiles

The Tail Latency Problem

Methods to Handle Growing Load

Vertical Scaling (Scale Up)

Horizontal Scaling (Scale Out)

Hybrid Scaling in Real Systems

Elastic Scaling vs Manual Scaling

Stateless vs Stateful Systems

Final Thoughts

Comments

More from this blog