<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Suman Prasad]]></title><description><![CDATA[This publication focuses on backend engineering, databases, system design, and concurrency, explaining complex computer science topics using real-world examples]]></description><link>https://blogs.sumanprasad.in</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1768984080964/9c059650-102f-4b4e-9bba-ad3aa539c7c7.png</url><title>Suman Prasad</title><link>https://blogs.sumanprasad.in</link></image><generator>RSS for Node</generator><lastBuildDate>Sat, 11 Apr 2026 01:44:07 GMT</lastBuildDate><atom:link href="https://blogs.sumanprasad.in/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Hypothetical Document Embeddings (HyDE): Smarter Retrieval in RAG]]></title><description><![CDATA[Most RAG systems work like this: Take user query → convert to embedding → search → generate answer
But here’s the issue: User queries are often too short, too vague, and missing context.
And because o]]></description><link>https://blogs.sumanprasad.in/hypothetical-document-embeddings-hyde-smarter-retrieval-in-rag</link><guid isPermaLink="true">https://blogs.sumanprasad.in/hypothetical-document-embeddings-hyde-smarter-retrieval-in-rag</guid><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[generative ai]]></category><category><![CDATA[llm]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Tue, 31 Mar 2026 14:54:29 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/643e1a46c689b269c0df875c/6241fdd7-2c38-4816-abee-8ab31232f99e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most RAG systems work like this: Take user query → convert to embedding → search → generate answer</p>
<p>But here’s the issue: User queries are often too short, too vague, and missing context.</p>
<p>And because of that, retrieval is not always accurate. So what if, instead of searching with a weak query</p>
<p>We first <strong>expand it into a rich document</strong>?</p>
<p>That’s exactly what <strong>HyDE (Hypothetical Document Embeddings)</strong> does.</p>
<h2>What is HyDE?</h2>
<p>HyDE is a retrieval technique where we:</p>
<ul>
<li><p>Generate a <strong>hypothetical document</strong> from the user query</p>
</li>
<li><p>Convert that document into embeddings</p>
</li>
<li><p>Use it to search for better context</p>
</li>
</ul>
<h2>Where does HyDE fit in RAG?</h2>
<p>A typical RAG pipeline:</p>
<ol>
<li><p><strong>Indexing</strong> → Store documents as embeddings</p>
</li>
<li><p><strong>Retrieval</strong> → Find relevant data</p>
</li>
<li><p><strong>Generation</strong> → Produce answer</p>
</li>
</ol>
<p>Here, HyDE improves the <strong>retrieval step</strong></p>
<p>Instead of Query → Search, we do Query → Generate document → Search</p>
<h2>How HyDE Works (Step-by-Step)</h2>
<p>Step 1: Generate a Hypothetical Document</p>
<p>We use the LLM’s internal knowledge to expand the query:</p>
<p>Step 2: Convert to Embeddings</p>
<p>Step 3: Perform Semantic Search</p>
<p>Since the input is rich, retrieval becomes: more aligned, more meaningful</p>
<p>Step 4: Generate Final Response</p>
<p>Now the model answers using: original query, high-quality retrieved context</p>
<h2>Why HyDE Works So Well?</h2>
<p>In normal conditions, the RAG, the query is short, which has weak embeddings and average retrieval.</p>
<p>In HyDE, the generated doc is rich, so there is strong embedding and better retrieval.</p>
<p>Essentially, we are expanding the query, adding the hidden context, and improving semantic matching.</p>
<h2>When Should You Use HyDE?</h2>
<p>When queries are too short or vague, the domain is complex, retrieval quality is inconsistent.</p>
<h2>Final Thought</h2>
<p>RAG is not just about storing embeddings.</p>
<p>It’s about <strong>how you search</strong>.</p>
<p>HyDE shifts the thinking from:</p>
<p>“Search what user said” to “Search what user <em>meant.</em>”</p>
<hr />
<p>If you found this useful, I write simple blogs on:</p>
<p>GenAI Systems, backend engineering, system design</p>
<p>Follow along to catch more.</p>
]]></content:encoded></item><item><title><![CDATA[Reciprocal Rank Fusion: Making RAG Retrieval Smarter]]></title><description><![CDATA[Most RAG systems follow a simple idea:
Take the user query → search similar data → generate response
But here’s the problem: what if the user query is incomplete or ambiguous?
You might retrieve:

par]]></description><link>https://blogs.sumanprasad.in/reciprocal-rank-fusion-making-rag-retrieval-smarter</link><guid isPermaLink="true">https://blogs.sumanprasad.in/reciprocal-rank-fusion-making-rag-retrieval-smarter</guid><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[generative ai]]></category><category><![CDATA[llm]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[Machine Learning]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Mon, 30 Mar 2026 17:05:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/643e1a46c689b269c0df875c/66dd1875-5b82-48c2-960b-d045e48d5264.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most RAG systems follow a simple idea:</p>
<p>Take the user query → search similar data → generate response</p>
<p>But here’s the problem: what if the <strong>user query is incomplete or ambiguous?</strong></p>
<p>You might retrieve:</p>
<ul>
<li><p>partially relevant data</p>
</li>
<li><p>or completely miss important context</p>
</li>
</ul>
<p>This is where <strong>Reciprocal Rank Fusion (RRF)</strong> comes in</p>
<h2>What is Reciprocal Rank Fusion?</h2>
<p>Reciprocal Rank Fusion is a retrieval technique that <strong>combines results from multiple queries and ranks them intelligently</strong>. Instead of relying on just one query, we:</p>
<ul>
<li><p>Generate multiple variations of the same query</p>
</li>
<li><p>Retrieve documents for each variation</p>
</li>
<li><p>Rank documents based on their importance across all queries</p>
</li>
</ul>
<p>If a document appears frequently across different queries and ranks higher, it is probably more relevant.</p>
<h2>Where does RRF fit in RAG?</h2>
<p>A typical RAG pipeline has three steps:</p>
<ol>
<li><p><strong>Indexing</strong> → Store data as embeddings</p>
</li>
<li><p><strong>Retrieval</strong> → Find relevant data</p>
</li>
<li><p><strong>Generation</strong> → Produce final answer</p>
</li>
</ol>
<p>RRF is applied in the <strong>retrieval phase.</strong> Instead of one query → one retrieval, we do multiple queries → multiple retrievals → ranked fusion</p>
<h2>How RRF Works ?</h2>
<p>Step 1: Generate Query Variations</p>
<p>We take the original user query and create similar versions</p>
<p>Step 2: Parallel Retrieval</p>
<p>Each query runs independently</p>
<p>Step 3: Rank Documents (Core of RRF)</p>
<p>Instead of merging blindly, we <strong>score documents based on rank positions</strong>.</p>
<p>RRF Formula: Score = ∑ (1 / (k + rank))</p>
<p>rank = position in the result list</p>
<p>k = constant (usually 60)</p>
<p>Step 4: Select Top Documents</p>
<p>Step 5: Generate Final Answer</p>
<img src="https://cdn.hashnode.com/uploads/covers/643e1a46c689b269c0df875c/e7a45ac7-de38-43fe-9b0b-b31ae4181c4e.png" alt="" style="display:block;margin:0 auto" />

<h2>Why RRF Improves Results?</h2>
<p>In normal RAG: One query → limited view → limited context</p>
<p>In RRF: Multiple perspectives → richer context → better answer</p>
<p>You are essentially:</p>
<ul>
<li><p>exploring different angles of the same question</p>
</li>
<li><p>merging the best information</p>
</li>
<li><p>prioritizing what matters most</p>
</li>
</ul>
<h2>Final Thought</h2>
<p>RAG is not just about embeddings.</p>
<p>It’s about <strong>how smart your retrieval is</strong>.</p>
<p>Techniques like:</p>
<ul>
<li><p>Query decomposition (Chain of Thought)</p>
</li>
<li><p>Query expansion (RRF)</p>
</li>
</ul>
<hr />
<p>If you found this useful, I write simple blogs on:</p>
<p>GenAI Systems, backend engineering, system design</p>
<p>Follow along to catch more.</p>
]]></content:encoded></item><item><title><![CDATA[Chain of Thought in RAG: Making Queries Smarter, Not Harder]]></title><description><![CDATA[When building RAG systems, one common problem shows up quickly:
The user asks one big question... but the system struggles to retrieve the right context. Because most user queries are too abstract.
So]]></description><link>https://blogs.sumanprasad.in/chain-of-thought-in-rag-making-queries-smarter-not-harder</link><guid isPermaLink="true">https://blogs.sumanprasad.in/chain-of-thought-in-rag-making-queries-smarter-not-harder</guid><category><![CDATA[Artificial Intelligence]]></category><category><![CDATA[backend]]></category><category><![CDATA[llm]]></category><category><![CDATA[RAG ]]></category><category><![CDATA[generative ai]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Thu, 26 Mar 2026 17:08:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/643e1a46c689b269c0df875c/06d63c0b-3482-4a22-9f31-efc2272c3c74.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When building RAG systems, one common problem shows up quickly:</p>
<p>The user asks one big question... but the system struggles to retrieve the right context. Because most user queries are <strong>too abstract</strong>.</p>
<p>So what are the steps involved in building a scalable and reliable distributed system?</p>
<p>The system would perform much better if that query were broken into smaller, focused questions.</p>
<p>That's exactly where <strong>Chain of Thought (CoT)</strong> comes in.</p>
<h2>What is Chain of Thought (CoT)?</h2>
<p>Chain of Thought is a technique where a complex query is broken into smaller, logical steps, and each step is processed one after another.</p>
<p>Instead of solving everything in one go, the system:</p>
<p>breaks the query</p>
<p>solves each part</p>
<p>uses previous results as context</p>
<p>gradually builds a better answer</p>
<p>Instead of jumping to the final answer instantly, reason step by step.</p>
<h2>What CoT Matters in RAG Systems?</h2>
<p>A typical RAG pipeline has three main steps:</p>
<ol>
<li><p>Indexing → storing data as embeddings</p>
</li>
<li><p>Retrieval → fetching relevant information</p>
</li>
<li><p>Generation → producing the final answer</p>
</li>
</ol>
<p>Usually, generation is not the problem; It's in retrieval</p>
<p>Let's say if the query is vague or broad, the retrieval step returns:</p>
<ul>
<li><p>weak context</p>
</li>
<li><p>irrelevant chunks</p>
</li>
<li><p>incomplete information</p>
</li>
</ul>
<p>And the final answer suffers.</p>
<p>This is the place where Chain of Thought helps.</p>
<h2>What CoT Matters in RAG Systems?</h2>
<p>CoT is applied in the retrieval stage.</p>
<p>Instead of sending one large query to the vector database, we:</p>
<ol>
<li><p>Break the query into sub-queries</p>
</li>
<li><p>Process them sequentially</p>
</li>
<li><p>Use previous outputs to improve the next retrieval</p>
</li>
</ol>
<p>So instead of:</p>
<img src="https://cdn.hashnode.com/uploads/covers/643e1a46c689b269c0df875c/8c10e802-5823-44a8-bd67-11886605d239.png" alt="" style="display:block;margin:0 auto" />

<p>We do:</p>
<img src="https://cdn.hashnode.com/uploads/covers/643e1a46c689b269c0df875c/b9ae9695-ad83-488e-8ced-6ac8e5b16f19.png" alt="" style="display:block;margin:0 auto" />

<h2>How Chain of Thought Works (Step by Step)</h2>
<h3>Step 1: Break the query</h3>
<p>Example: User Query → How does a scalable RAG system handle large traffic and ensure accurate responses?</p>
<p>LLM break into:</p>
<ol>
<li><p>What is scalability in RAG systems?</p>
</li>
<li><p>How does retrieval work in RAG?</p>
</li>
<li><p>How do we improve retrieval accuracy?</p>
</li>
</ol>
<h3>Step 2: Process First Sub-query</h3>
<p>Take the first sub-query</p>
<ul>
<li><p>generate embeddings</p>
</li>
<li><p>perform a semantic search</p>
</li>
<li><p>retrieve relevant chunks</p>
</li>
<li><p>generate a response</p>
</li>
</ul>
<h3>Step 3: Pass Context Forward</h3>
<p>The output of <strong>step 1</strong> <strong>becomes the context for step 2.</strong> So instead of starting fresh every time, the system builds knowledge step by step.</p>
<h3>Step 4: Repeat Sequentially</h3>
<p>Continue this process:</p>
<ul>
<li><p>Each sub-query uses previous responses</p>
</li>
<li><p>Context becomes richer</p>
</li>
<li><p>Retrieval becomes more precise</p>
</li>
</ul>
<h3>Step 5: Final Output</h3>
<p>The response generated from the last sub-query becomes the final answer.</p>
<h2>Example With Code</h2>
<pre><code class="language-python">import os
import json
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()

client = OpenAI(
    api_key=os.getenv("GROQ_API_KEY"),
    base_url="https://api.groq.com/openai/v1"
)
system_prompt = """
You are an AI assistant who is expert in breaking down complex problems and then resolve the user query.
For the given user input, analyse the input and break down the problem step by step.
Atleast thing 5-6 steps on how to solve the problem before solving it down.
The steps are you get a user input, you analyse,you think, you again think for several times and then return an output with explanation and then finally you validate the output as well before giving final result.
Follow the steps in sequence that is "analyse", "think", "output", "validate" and finally "result".
Rules:
1. Follow the strict JSON output as per Output schema.
2. Always perform one step at a time and wait for next input
3. Carefully analyse the user query

Output Format:
{{ step: "string", content: "string" }}
Example: 
Input: What is 2 + 2.
Output: {{ step: "analyse", content: "Alright! The user is interested in maths query and he is asking a basic arithmetic operation"}}
Output: {{ step: "think", content: "To perform the addition i must go from left t right and add all the operands.}}
Output: {{ step: "output", content: "4" }}
Output: {{ step: "validate", content: "seems like 4 is correct ans for 2 + 2" }}
Output: {{ step: "result", content: "2 + 2 = 4 and that is calculated by adding all numbers" }}
"""
messages = [
    {"role": "system", "content": system_prompt},
]
query = input("&gt; ")
messages.append({"role": "user", "content": query})

while True:
    resonse = client.chat.completions.create(
        model="llama-3.1-8b-instant",
        response_format={"type": "json_object"},
        messages=messages
    )
    parsed_response = json.loads(resonse.choices[0].message.content)
    messages.append({"role": "assistant", "content":         json.dumps(parsed_response)})
    if parsed_response.get("step") != "output":
        print(f"{parsed_response.get('step')}:     {parsed_response.get('content')}")
        continue
    print(f"output: {parsed_response.get('content')}")
    break 
   
</code></pre>
<pre><code class="language-python">&gt; what is 2 + 10 * 5
analyse: Alright! The user is interested in maths query and has a query with a mix of addition and multiplication operation with a BODMAS(Brackets,Order, Division, Multiplication, Addition, Subtraction) application required, where the first non bracketed operator is multiplication.
think: To solve this, I need to follow the BODMAS rule and perform operations from left to right. First, I will perform the multiplication and then add the result to 2. I need to consider 10 being multiplied by 5 to get the correct intermediate result before moving to next operation which is addition with 2.
think: First, I must consider multiplication. 10 * 5 = 50, which is the result of the first operation in the given expression. Now, I proceed to the addition of result 50 and 2.
output: 52
</code></pre>
<h2>Why This Improves Accuracy</h2>
<p>It reduces abstraction. Instead of asking one vague question, it focuses on smaller parts, retrieves focused information, and builds a layered context, which leads to better semantic search, more relevant chunks, and higher-quality responses.</p>
<h2>When Should You Use CoT in RAG?</h2>
<p>Chain of Thought is useful when queries are complex, questions involve multiple steps, context needs to be built gradually, and retrieval quality is poor.</p>
<h2>Final Thoughts</h2>
<p>RAG systems don't fail because generation is weak. They fail because retrieval is shallow.</p>
<p>Chain of Thought fixes this by:</p>
<ul>
<li><p>breaking queries into meaningful steps</p>
</li>
<li><p>increasing context depth</p>
</li>
<li><p>improving semantic retrieval</p>
</li>
</ul>
<hr />
<p>If you found this useful, I write simple blogs on:</p>
<p>GenAI Systems, backend engineering, system design</p>
<p>Follow along to catch more.</p>
]]></content:encoded></item><item><title><![CDATA[Understanding Maintainability in System Design]]></title><description><![CDATA[When people talk about system design, the conversation usually focuses on scalability or reliability.
But in reality, many systems don’t fail because they cannot scale.They fail because they become to]]></description><link>https://blogs.sumanprasad.in/understanding-maintainability-in-system-design</link><guid isPermaLink="true">https://blogs.sumanprasad.in/understanding-maintainability-in-system-design</guid><category><![CDATA[System Design]]></category><category><![CDATA[software development]]></category><category><![CDATA[backend]]></category><category><![CDATA[software architecture]]></category><category><![CDATA[software]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Sun, 01 Mar 2026 04:39:42 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/643e1a46c689b269c0df875c/0f034284-4bc5-4df9-80f5-1c86a5d91b1e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When people talk about system design, the conversation usually focuses on <strong>scalability</strong> or <strong>reliability</strong>.</p>
<p>But in reality, many systems don’t fail because they cannot scale.<br />They fail because they become <strong>too painful to maintain</strong>.</p>
<p>A system that is hard to understand, modify, or operate will eventually slow down development and create constant operational problems.</p>
<p>This is why <strong>maintainability</strong> is a critical property of well-designed systems.</p>
<h2>Why Maintainability Matters</h2>
<p>Most of the cost of software does not come from writing the first version.</p>
<p>It comes from <strong>maintaining it over time</strong>.</p>
<p>Maintenance includes many things:</p>
<ul>
<li><p>fixing bugs</p>
</li>
<li><p>adding new features</p>
</li>
<li><p>upgrading dependencies</p>
</li>
<li><p>improving performance</p>
</li>
<li><p>adapting to new requirements</p>
</li>
<li><p>keeping the system running smoothly</p>
</li>
</ul>
<p>In many organizations, engineers spend <strong>far more time maintaining systems than building them from scratch</strong>.</p>
<p>This is why a system that is easy to maintain can significantly improve long-term productivity.</p>
<h2>The Legacy System Problem</h2>
<p>Many developers have experienced working on what people often call a <strong>legacy system</strong>.</p>
<p>These systems usually have characteristics like:</p>
<ul>
<li><p>complicated code structure</p>
</li>
<li><p>poor documentation</p>
</li>
<li><p>unclear design decisions</p>
</li>
<li><p>tightly coupled components</p>
</li>
</ul>
<p>Because of this, even small changes become risky.</p>
<p>Developers often hesitate to modify such systems because a simple update might break something unexpected.</p>
<p>The goal of good system design is to <strong>avoid creating tomorrow’s legacy system today</strong>.</p>
<h2>Three Principles of Maintainable Systems</h2>
<p>Maintainable systems usually share three key qualities:</p>
<ol>
<li><p><strong>Operability</strong></p>
</li>
<li><p><strong>Simplicity</strong></p>
</li>
<li><p><strong>Evolvability</strong></p>
</li>
</ol>
<p>Together, these principles make systems easier to run, understand, and modify.</p>
<h2>Operability: Making Systems Easy to Run</h2>
<p>Software does not run itself.</p>
<p>Operations teams (or DevOps engineers) are responsible for keeping systems healthy and running smoothly.</p>
<p>Their tasks include:</p>
<ul>
<li><p>monitoring system health</p>
</li>
<li><p>diagnosing failures</p>
</li>
<li><p>managing deployments</p>
</li>
<li><p>applying security patches</p>
</li>
<li><p>planning infrastructure capacity</p>
</li>
<li><p>handling configuration changes</p>
</li>
</ul>
<p>If a system is difficult to operate, even small issues can take hours to diagnose.</p>
<p>Good systems, therefore, aim to make the <strong>daily life of operators easier</strong>.</p>
<h2>How Systems Improve Operability</h2>
<p>A well-designed system usually provides several operational capabilities.</p>
<h3>Monitoring and Metrics</h3>
<p>Operators need visibility into system behavior.</p>
<p>Typical metrics include:</p>
<ul>
<li><p>error rates</p>
</li>
<li><p>request latency</p>
</li>
<li><p>CPU and memory usage</p>
</li>
<li><p>traffic patterns</p>
</li>
</ul>
<p>Monitoring allows teams to detect problems early before users are affected.</p>
<h3>Automation Support</h3>
<p>Routine operational tasks should be automated.</p>
<p>Examples include:</p>
<ul>
<li><p>deployments</p>
</li>
<li><p>scaling infrastructure</p>
</li>
<li><p>backups</p>
</li>
<li><p>system recovery</p>
</li>
</ul>
<p>Automation reduces human error and improves consistency.</p>
<h3>Safe Maintenance</h3>
<p>Sometimes operators must take machines offline for maintenance.</p>
<p>A well-designed system should allow this without affecting users.</p>
<p>This often requires <strong>load balancing and redundancy</strong> so traffic can be redirected.</p>
<h3>Good Documentation</h3>
<p>Clear documentation helps teams understand:</p>
<ul>
<li><p>How the system works</p>
</li>
<li><p>How to troubleshoot problems</p>
</li>
<li><p>How to deploy new versions</p>
</li>
</ul>
<p>Without documentation, even simple operational tasks become difficult.</p>
<h2>Simplicity: Managing Complexity</h2>
<p>As systems grow, complexity naturally increases.</p>
<p>More services are added.<br />More dependencies appear.<br />More interactions occur between components.</p>
<p>If complexity is not controlled, the system eventually becomes very difficult to understand.</p>
<p>Engineers sometimes describe such systems as a <strong>"big ball of mud"</strong>.</p>
<h2>Signs of a Complex System</h2>
<p>A system suffering from excessive complexity often shows patterns like:</p>
<ul>
<li><p>tightly coupled modules</p>
</li>
<li><p>tangled dependencies</p>
</li>
<li><p>inconsistent naming</p>
</li>
<li><p>excessive special cases</p>
</li>
<li><p>hidden assumptions in code</p>
</li>
</ul>
<p>These issues make systems fragile and slow down development.</p>
<h2>Essential vs Accidental Complexity</h2>
<p>Not all complexity is bad.</p>
<p>There are two types of complexity in software systems.</p>
<h3>Essential Complexity</h3>
<p>This comes from the actual problem the system is trying to solve.</p>
<p>For example:</p>
<ul>
<li><p>handling financial transactions</p>
</li>
<li><p>managing user authentication</p>
</li>
<li><p>processing large datasets</p>
</li>
</ul>
<p>This complexity cannot be removed.</p>
<h3>Accidental Complexity</h3>
<p>This comes from poor design decisions.</p>
<p>Examples include:</p>
<ul>
<li><p>unnecessary abstractions</p>
</li>
<li><p>confusing APIs</p>
</li>
<li><p>poorly structured code</p>
</li>
<li><p>duplicated logic</p>
</li>
</ul>
<p>The goal of good system design is to <strong>reduce accidental complexity as much as possible</strong>.</p>
<h2>Abstraction: The Key Tool for Simplicity</h2>
<p>One of the most powerful tools for managing complexity is <strong>abstraction</strong>.</p>
<p>Abstraction hides internal implementation details and exposes a simpler interface.</p>
<p>We see this concept everywhere in software.</p>
<p>Examples include:</p>
<ul>
<li><p>Programming languages hiding machine instructions</p>
</li>
<li><p>SQL hides low-level storage details</p>
</li>
<li><p>APIs hiding internal service logic</p>
</li>
</ul>
<p>By hiding complexity, abstraction makes systems easier to understand and maintain.</p>
<p>However, designing good abstractions requires careful thinking.</p>
<p>Poor abstractions can sometimes create even more complexity.</p>
<h2>Evolvability: Designing Systems That Can Change</h2>
<p>One thing is certain in software development.</p>
<p><strong>Requirements will change.</strong></p>
<p>Over time, systems must adapt to:</p>
<ul>
<li><p>new user needs</p>
</li>
<li><p>evolving business goals</p>
</li>
<li><p>growing datasets</p>
</li>
<li><p>new technologies</p>
</li>
<li><p>regulatory requirements</p>
</li>
</ul>
<p>A system that cannot adapt easily becomes obsolete.</p>
<p>This ability to adapt is called <strong>evolvability</strong>.</p>
<h2>Code-Level vs System-Level Change</h2>
<p>Many development practices improve changeability at the code level.</p>
<p>Examples include:</p>
<ul>
<li><p>refactoring</p>
</li>
<li><p>automated testing</p>
</li>
<li><p>test-driven development</p>
</li>
</ul>
<p>These techniques make it easier to modify small pieces of code safely.</p>
<p>However, large systems also need to evolve at the <strong>architectural level</strong>.</p>
<p>For example, a company might redesign how data is stored or how services communicate.</p>
<p>Such changes require systems that are <strong>simple and modular enough to evolve</strong>.</p>
<hr />
<h2>Why Simplicity Enables Evolvability</h2>
<p>Simplicity and evolvability are closely connected.</p>
<p>Simple systems are easier to:</p>
<ul>
<li><p>understand</p>
</li>
<li><p>modify</p>
</li>
<li><p>extend</p>
</li>
<li><p>debug</p>
</li>
</ul>
<p>Complex systems make changes risky because engineers cannot easily predict the consequences.</p>
<p>This is why simplicity is one of the strongest foundations for long-term maintainability.</p>
<h2>Final Thoughts</h2>
<p>Building software is only the beginning of a system’s lifecycle.</p>
<p>The real challenge lies in <strong>running and evolving that system over time</strong>.</p>
<p>Maintainable systems are built with three goals in mind:</p>
<ul>
<li><p>making operations manageable</p>
</li>
<li><p>keeping system design simple</p>
</li>
<li><p>enabling future changes</p>
</li>
</ul>
<p>When systems achieve these qualities, they remain productive and adaptable even as requirements grow and technology evolves.</p>
<h2>Series Summary</h2>
<p>This concludes the three-part series on important properties of well-designed systems:</p>
<p>1️⃣ <a href="https://blogs.sumanprasad.in/why-systems-fail-and-how-reliable-systems-survive"><strong>Reliability</strong></a> — systems that continue working despite faults<br />2️⃣ <a href="https://blogs.sumanprasad.in/understanding-scalability-in-system-design"><strong>Scalability</strong></a> — systems that handle growing demand<br />3️⃣ <strong>Maintainability</strong> — systems that remain easy to operate and evolve</p>
<p>Together, these principles form the foundation of strong system design.</p>
]]></content:encoded></item><item><title><![CDATA[Understanding Scalability in System Design]]></title><description><![CDATA[Modern systems rarely fail because they are badly written.
More often, they fail because they cannot handle growth.
A system that works perfectly with 1,000 users may completely collapse when the user]]></description><link>https://blogs.sumanprasad.in/understanding-scalability-in-system-design</link><guid isPermaLink="true">https://blogs.sumanprasad.in/understanding-scalability-in-system-design</guid><category><![CDATA[System Design]]></category><category><![CDATA[Backend Engineering]]></category><category><![CDATA[distributed systems]]></category><category><![CDATA[scalability]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Sat, 21 Feb 2026 16:13:17 GMT</pubDate><enclosure url="https://cloudmate-test.s3.us-east-1.amazonaws.com/uploads/covers/643e1a46c689b269c0df875c/58c573f6-3e3a-42c5-a030-895f2e45ebc2.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern systems rarely fail because they are badly written.</p>
<p>More often, they fail because they <strong>cannot handle growth.</strong></p>
<p>A system that works perfectly with 1,000 users may completely collapse when the user base reaches 1 million. This is why <strong>scalability</strong> becomes one of the most important topics in system design.</p>
<p>In this article, we'll understand what scalability means, how engineers measure system load, and the different ways systems grow to handle increasing demand.</p>
<h2>What is Scalability?</h2>
<p>In simple terms, <strong>scalability is the ability of a system to handle growth.</strong></p>
<p>Growth can happen in different ways:</p>
<ul>
<li><p>More users are joining the system.</p>
</li>
<li><p>More requests are being sent to the system.</p>
</li>
<li><p>More data is being stored.</p>
</li>
<li><p>Higher traffic during peak hours.</p>
</li>
</ul>
<p>A scalable system should be able to <strong>continue performing well even when demand increases.</strong></p>
<p>For example:</p>
<p>Imagine an e-commerce platform during a festive sale.</p>
<p>If the number of users suddenly increases from <strong>10,000 to 1 million</strong>, the system should still:</p>
<ul>
<li><p>process order</p>
</li>
<li><p>update inventory</p>
</li>
<li><p>show product pages quickly</p>
</li>
</ul>
<p>If it cannot handle this increase, the system is <strong>not scalable.</strong></p>
<h2>Understanding System Load</h2>
<p>Before we talk about scaling, we must first understand <strong>what kind of load the system in handling.</strong></p>
<p>Different systems measure load in different ways.</p>
<p>Some common load parameters include:</p>
<ul>
<li><p>Requests per second</p>
</li>
<li><p>Number of active users</p>
</li>
<li><p>Read vs write operations</p>
</li>
<li><p>Cache hit rate</p>
</li>
<li><p>Amount of stored data</p>
</li>
</ul>
<p>These metrics help engineers <strong>understand where the system is under pressure.</strong></p>
<p>For example:</p>
<ul>
<li><p>A video streaming platform may focus on <strong>bandwidth and concurrent users</strong></p>
</li>
<li><p>A messaging app may care about <strong>messages per second</strong></p>
</li>
<li><p>A social network may track <strong>timeline requests per second</strong></p>
</li>
</ul>
<p>Understanding the correct metric is the <strong>first step toward designing scalable systems.</strong></p>
<h2>Real-World Example: Twitter (Now X) Timeline Problem</h2>
<p>One of the most famous scalability challenges comes from social media platforms like Twitter.</p>
<p>Two common operations happen on such systems:</p>
<ul>
<li><p>Posting a tweet</p>
</li>
<li><p>Viewing a user's timeline</p>
</li>
</ul>
<p>At first glance, posting tweets seems simple. But the real challenge lies in <strong>distributing that tweet to millions of followers</strong>.</p>
<p>Let’s look at two different ways to design the timeline system.</p>
<h2>Approach 1: Compute Timeline When User Reads</h2>
<p>In this design, the system calculates the timeline <strong>only when the user opens it</strong>.</p>
<p>Steps:</p>
<ol>
<li><p>Find all users the current user follows</p>
</li>
<li><p>Fetch their recent tweets</p>
</li>
<li><p>Merge and sort them</p>
</li>
</ol>
<p>Example query:</p>
<pre><code class="language-bash">SELECT tweets.*, users.*
FROM tweets
JOIN users ON tweets.sender_id = users.id
JOIN follows ON follows.followee_id = users.id
WHERE follows.follower_id = current_user
</code></pre>
<h3>Advantage</h3>
<p>Writes are cheap because the system only stores tweets once.</p>
<h3>Problem</h3>
<p>Reads become very expensive.</p>
<p>If millions of users open their timelines at the same time, the system must run <strong>millions of complex queries</strong>.</p>
<p>This approach struggles when the <strong>read traffic is extremely high</strong>.</p>
<h2>Approach 2: Precompute Timeline When Tweet is Posted</h2>
<p>Another design flips the logic.</p>
<p>Instead of computing timelines when users read them, the system prepares the timeline <strong>when a tweet is created</strong>.</p>
<p>Steps:</p>
<ol>
<li><p>User posts a tweet</p>
</li>
<li><p>The system copies that tweet into the timeline of each follower</p>
</li>
</ol>
<p>Now when a user opens their timeline, the data is <strong>already prepared</strong>.</p>
<h3>Advantage</h3>
<p>Reading timelines becomes extremely fast.</p>
<h3>Problem</h3>
<p>Writes become expensive.</p>
<p>Imagine:</p>
<ul>
<li><p>Average user has <strong>75 followers</strong></p>
</li>
<li><p>If <strong>4,000 tweets are posted per second</strong></p>
</li>
</ul>
<p>The system now performs:</p>
<pre><code class="language-bash">4,000 × 75 = 300,000 writes per second
</code></pre>
<p>For celebrities with <strong>millions of followers</strong>, a single tweet could generate <strong>millions of database writes</strong>.</p>
<h2>Hybrid Design Used in Practice</h2>
<p>Real systems rarely use just one approach.</p>
<p>Instead, they combine both.</p>
<p>Typical strategy:</p>
<ul>
<li><p><strong>Normal users → Fan-out on write</strong></p>
</li>
<li><p><strong>Celebrities → Compute on read</strong></p>
</li>
</ul>
<p>This reduces the load caused by huge follower counts.</p>
<p>The key takeaway here is:</p>
<p><strong>Scalability solutions depend heavily on usage patterns.</strong></p>
<p>There is no universal design that works for every system.</p>
<h2>Measuring System Performance</h2>
<blockquote>
<p>When traffic increases, engineers usually ask two questions.</p>
<h3>Question 1</h3>
<p>If load increases but resources stay the same,<br /><strong>how does system performance change?</strong></p>
<h3>Question 2</h3>
<p>If load increases,<br /><strong>how many additional resources are needed?</strong></p>
<p>To answer these questions, we measure system performance using two main metrics.</p>
</blockquote>
<h2>Throughput</h2>
<p>Throughput measures <strong>how much work the system can process</strong>.</p>
<p>Example:</p>
<ul>
<li><p>records processed per second</p>
</li>
<li><p>tasks completed per minute</p>
</li>
</ul>
<p>Throughput is commonly used in <strong>batch processing systems</strong> like data pipelines.</p>
<h2>Response Time</h2>
<p>Response time measures <strong>how long a user waits for a response</strong>.</p>
<p>This includes:</p>
<ul>
<li><p>processing time</p>
</li>
<li><p>network delay</p>
</li>
<li><p>waiting in queues</p>
</li>
</ul>
<p>In most web systems, response time is the <strong>most important user-facing metric</strong>.</p>
<h2>Latency vs Response Time</h2>
<p>People often mix these terms, but they are slightly different.</p>
<p><strong>Latency</strong></p>
<p>The time a request waits before processing starts.</p>
<p><strong>Response Time</strong></p>
<p>Total time from request to response.</p>
<pre><code class="language-bash">Response Time = Latency + Processing Time + Network Delay
</code></pre>
<p>Users care about <strong>response time</strong>, because that represents how long they actually wait.</p>
<h2>Why Average Response Time is Misleading</h2>
<p>Many engineers make the mistake of measuring <strong>average response time</strong>.</p>
<p>But averages hide slow requests.</p>
<p>Example:</p>
<p>If most requests take <strong>100 ms</strong> but a few take <strong>5 seconds</strong>, the average may still look fine.</p>
<p>However, those slow requests create a <strong>bad user experience</strong>.</p>
<p>This is why engineers rely on <strong>percentiles</strong>.</p>
<h2>Understanding Percentiles</h2>
<p>Percentiles show how slow the worst requests are.</p>
<p>Common metrics include:</p>
<table style="min-width:50px"><colgroup><col style="min-width:25px"></col><col style="min-width:25px"></col></colgroup><tbody><tr><td><p>Percentile</p></td><td><p>Meaning</p></td></tr><tr><td><p>50th</p></td><td><p>Median response time</p></td></tr><tr><td><p>95th</p></td><td><p>Slow requests</p></td></tr><tr><td><p>99th</p></td><td><p>Very slow edge cases</p></td></tr></tbody></table>

<p>Large tech companies often monitor the <strong>99th percentile latency</strong> to ensure even rare slow requests are under control.</p>
<h2>The Tail Latency Problem</h2>
<p>Modern systems often depend on multiple services.</p>
<p>Example:</p>
<p>A single request may involve:</p>
<ul>
<li><p>authentication service</p>
</li>
<li><p>recommendation engine</p>
</li>
<li><p>database</p>
</li>
<li><p>payment service</p>
</li>
</ul>
<p>The overall response must wait for <strong>the slowest service</strong>.</p>
<p>This problem is known as <strong>tail latency amplification</strong>.</p>
<p>Even if most services are fast, one slow component can delay the entire request.</p>
<h2>Methods to Handle Growing Load</h2>
<p>Once engineers understand the load, they decide <strong>how to scale the system</strong>.</p>
<p>There are two common approaches.</p>
<h3>Vertical Scaling (Scale Up)</h3>
<p>This means upgrading a machine with more resources.</p>
<p>Example:</p>
<ul>
<li><p>more CPU</p>
</li>
<li><p>more RAM</p>
</li>
<li><p>faster disks</p>
</li>
</ul>
<p><strong>Advantages</strong></p>
<p>Simple to implement.</p>
<p><strong>Limitations</strong></p>
<p>Machines cannot grow infinitely.<br />Eventually, hardware upgrades become extremely expensive</p>
<h3>Horizontal Scaling (Scale Out)</h3>
<p>Instead of upgrading one machine, the system <strong>adds more machines</strong>.</p>
<p>The workload is distributed across multiple servers.</p>
<p>This architecture is often called <strong>shared-nothing architecture</strong>, because each machine works independently.</p>
<p><strong>Advantages</strong></p>
<p>Can support very large systems.</p>
<p><strong>Challenges</strong></p>
<p>More operational complexity.</p>
<h3>Hybrid Scaling in Real Systems</h3>
<p>Most real systems combine both strategies.</p>
<p>For example:</p>
<ul>
<li><p>a few powerful machines</p>
</li>
<li><p>combined with distributed clusters</p>
</li>
</ul>
<p>This allows systems to handle <strong>both heavy workloads and large data volumes</strong>.</p>
<h3>Elastic Scaling vs Manual Scaling</h3>
<p>Scaling can happen automatically or manually.</p>
<p><strong>Elastic Scaling</strong></p>
<p>Infrastructure automatically adds or removes servers depending on traffic.</p>
<p>Common in cloud platforms.</p>
<p><strong>Manual Scaling</strong></p>
<p>Engineers decide when to add servers.</p>
<p>Simpler but slower to respond to sudden traffic spikes.</p>
<h3>Stateless vs Stateful Systems</h3>
<p>Scaling also depends on whether a service is <strong>stateless or stateful</strong>.</p>
<p><strong>Stateless Services</strong></p>
<p>These services do not store user data locally.</p>
<p>Examples:</p>
<ul>
<li><p>API servers</p>
</li>
<li><p>web servers</p>
</li>
</ul>
<p>They are easy to scale — just add more instances.</p>
<p><strong>Stateful Systems</strong></p>
<p>These store persistent data.</p>
<p>Examples:</p>
<ul>
<li><p>databases</p>
</li>
<li><p>storage systems</p>
</li>
</ul>
<p>Scaling them requires <strong>data partitioning or replication</strong>, which adds complexity.</p>
<h2>Final Thoughts</h2>
<p>Scalability is not about predicting the future perfectly.<br />It is about <strong>designing systems that can grow when needed</strong>.</p>
<p>Good scalable systems start by understanding:</p>
<ul>
<li><p>system load</p>
</li>
<li><p>usage patterns</p>
</li>
<li><p>performance metrics</p>
</li>
</ul>
<p>There is no universal architecture that works everywhere.</p>
<p>Each system must be designed based on <strong>how users interact with it and how the workload behaves</strong>.</p>
<p>In the next part of this series, we’ll explore another critical property of good systems — <strong>Maintainability</strong>.</p>
]]></content:encoded></item><item><title><![CDATA[Why Systems Fail - And How Reliable Systems Survive]]></title><description><![CDATA[When a system goes down, users don’t care whether it was a server crash, a bug, or a configuration mistake.
All they see is one thing:
“The app is not working.”
That moment when users can’t use the system is what truly matters. And preventing that mo...]]></description><link>https://blogs.sumanprasad.in/why-systems-fail-and-how-reliable-systems-survive</link><guid isPermaLink="true">https://blogs.sumanprasad.in/why-systems-fail-and-how-reliable-systems-survive</guid><category><![CDATA[System Design]]></category><category><![CDATA[backend]]></category><category><![CDATA[software development]]></category><category><![CDATA[Databases]]></category><category><![CDATA[distributed system]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Wed, 11 Feb 2026 17:03:01 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770829109741/4f3a12d5-aa62-41fb-b609-7e0f576b2fa3.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When a system goes down, users don’t care whether it was a server crash, a bug, or a configuration mistake.</p>
<p>All they see is one thing:</p>
<p>“The app is not working.”</p>
<p>That moment when users can’t use the system is what truly matters. And preventing that moment is what reliability is all about.</p>
<p>Reliability is not about building systems that never break.</p>
<p>That’s impossible.</p>
<p>Reliability is about building systems that <strong>continue to work even when parts of them fail.</strong></p>
<h2 id="heading-reliability-is-about-trust">Reliability Is About Trust</h2>
<p>Every system, big or small, makes a promise to its users.</p>
<ul>
<li><p>A payment app promises that money will move safely.</p>
</li>
<li><p>A photo app promises that memories won’t disappear.</p>
</li>
<li><p>A business tool promises that work won’t be lost.</p>
</li>
</ul>
<p>When that promise breaks, users don’t just get frustrated - they lose trust.</p>
<p>And trust, once lost, is very hard to win back.</p>
<p>That’s why reliability matters far beyond critical systems like airplanes or hospitals.<br />It matters just as much for everyday products.</p>
<h2 id="heading-faults-vs-failures-a-small-difference-that-matters">Faults vs Failures - A Small Difference That Matters</h2>
<p>Inside any system, problems are constantly happening.</p>
<p>A disk might stop working.<br />A server might crash.<br />A network might slow down.<br />A bug might get triggered.</p>
<p>These are faults.</p>
<p>But a fault is not the same as a failure.</p>
<p>These are faults.</p>
<p>But a fault is not the same as a failure.</p>
<p>A failure is when the user feels the impact.<br />When the app stops responding.<br />When data becomes unavailable.<br />When something breaks from the user’s point of view.</p>
<p>Good systems accept that faults will happen.<br />Their goal is simple: Don’t let internal problems become user-visible failures.</p>
<h2 id="heading-you-cant-avoid-problems-you-can-only-prepare-for-them">You Can’t Avoid Problems — You Can Only Prepare for Them</h2>
<p>No matter how carefully you design a system, things will go wrong.</p>
<p>Hardware wears out.<br />Software has bugs.<br />Humans make mistakes.</p>
<p>So the real question is not:</p>
<p>“How do we stop failures from ever happening?”</p>
<p>The real question is:</p>
<p>“How do we make sure the system survives when they do?”</p>
<p>This mindset leads to something called fault tolerance.</p>
<p>A fault-tolerant system expects trouble. It detects issues early. It recovers quickly. And it keeps serving users.</p>
<p>Some companies even go a step further. They intentionally break their own systems to test them.</p>
<p>Why?<br />Because if your system only works in perfect conditions, it’s not reliable.</p>
<h2 id="heading-hardware-problems-are-normal">Hardware Problems Are Normal</h2>
<p>In large systems, hardware failure is not rare — it’s routine.</p>
<p>Disks fail.<br />Machines lose power.<br />Network connections drop.</p>
<p>In environments with thousands of machines, something breaks almost every day.</p>
<p>Modern systems don’t try to make each machine perfect.<br />Instead, they assume: Some machines will fail — design around that reality.</p>
<p>That’s why systems use:</p>
<ul>
<li><p>Multiple servers</p>
</li>
<li><p>Redundant storage</p>
</li>
<li><p>Backup power</p>
</li>
<li><p>Replicated data</p>
</li>
</ul>
<p>The goal is simple: if one part stops working, another part takes over.</p>
<h2 id="heading-software-errors-are-harder-to-predict">Software Errors Are Harder to Predict</h2>
<p>Hardware problems are random.</p>
<p>Software problems are different.</p>
<p>A single hidden bug can affect every server at the same time.</p>
<p>Sometimes these bugs stay invisible for years.<br />Then one day, under a rare condition, they trigger and cause widespread issues.</p>
<p>Even worse, software failures can create chain reactions.</p>
<p>One service slows down → Another waits → Queues build up → Timeouts increase → More services fail.</p>
<p>And suddenly, a small issue becomes a major outage.</p>
<p>This is why strong system design, testing, and monitoring are essential.<br />Not to remove all bugs, but to catch them early and contain the damage.</p>
<h2 id="heading-human-mistakes-cause-the-most-outages">Human Mistakes Cause the Most Outages</h2>
<p>Surprisingly, the biggest cause of system failures isn’t hardware or software.</p>
<p>It’s people.</p>
<p>A wrong configuration, a mistaken deployment, or a command executed in the wrong environment.</p>
<p>These small errors can bring down large systems.</p>
<p>Good teams don’t try to eliminate human mistakes completely.<br />Instead, they build systems that are safer to operate.</p>
<p>That means:</p>
<ul>
<li><p>Safe testing environments</p>
</li>
<li><p>Easy rollback options</p>
</li>
<li><p>Gradual deployments</p>
</li>
<li><p>Clear monitoring</p>
</li>
</ul>
<p>So when something goes wrong, recovery is fast.</p>
<h2 id="heading-reliability-is-a-responsibility">Reliability Is a Responsibility</h2>
<p>It’s easy to think reliability only matters for “critical” systems.</p>
<p>But even simple applications carry responsibility.</p>
<p>If a system loses financial data, it causes stress.<br />If it loses business records, it causes damage.<br />If it loses personal memories, it causes emotional loss.</p>
<p>Even if an app isn’t life-critical, reliability still matters deeply to the people using it.</p>
<h2 id="heading-the-reality-of-trade-offs">The Reality of Trade-Offs</h2>
<p>Not every system can be extremely reliable from day one.</p>
<p>Startups, prototypes, and early-stage products often focus more on speed than perfection.<br />And that’s okay, but it should always be a conscious decision. Because improving reliability later becomes harder if it wasn’t considered early.</p>
<h2 id="heading-a-simple-way-to-remember-reliability">A Simple Way to Remember Reliability</h2>
<p>At its core, reliability is about one thing:</p>
<p>When something breaks, does the system still work?</p>
<p>Reliable systems:</p>
<ul>
<li><p>Expect faults</p>
</li>
<li><p>Absorb failures</p>
</li>
<li><p>Recover quickly</p>
</li>
<li><p>Protect user trust</p>
</li>
</ul>
<p>And that’s what separates fragile systems from strong ones.</p>
<hr />
<p>If you found this useful, I write simple, practical blogs on backend systems, databases, and system design.<br />Follow along to catch the next post in this series - we’ll explore <strong>Scalability</strong> next.</p>
]]></content:encoded></item><item><title><![CDATA[Exploring REST: Beyond Basic HTTP APIs Explained]]></title><description><![CDATA[When people hear REST, they often think it simply means “an HTTP endpoint that returns JSON.”
But REST is much more than that. It’s a way of designing interactions between clients and servers, not a library or a framework.
Let’s break REST down in a ...]]></description><link>https://blogs.sumanprasad.in/exploring-rest-beyond-basic-http-apis-explained</link><guid isPermaLink="true">https://blogs.sumanprasad.in/exploring-rest-beyond-basic-http-apis-explained</guid><category><![CDATA[REST API]]></category><category><![CDATA[backend]]></category><category><![CDATA[Web Development]]></category><category><![CDATA[System Design]]></category><category><![CDATA[software development]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Wed, 04 Feb 2026 05:46:00 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1770183241162/39c96b82-6601-4815-a8c0-0cc84088a0f2.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When people hear REST, they often think it simply means “an HTTP endpoint that returns JSON.”</p>
<p>But REST is <strong>much more than that</strong>. It’s a way of <strong>designing interactions between clients and servers</strong>, not a library or a framework.</p>
<p>Let’s break REST down in a simple and practical way.</p>
<h2 id="heading-what-rest-actually-is">What REST actually is?</h2>
<p>REST stands for <strong>Representational State Transfer</strong>.</p>
<p>It is a set of <strong>architectural principles</strong> that describes how a client and a server should communicate.</p>
<p>REST:</p>
<ul>
<li><p>Does not force you to use a specific language</p>
</li>
<li><p>Does not enforce a framework</p>
</li>
<li><p>Does not dictate how data is stored internally</p>
</li>
</ul>
<p>It only defines <strong>how resources are identified, accessed, and represented</strong>.</p>
<h2 id="heading-everything-is-a-resource">Everything Is a Resource</h2>
<p>At the heart of REST is the idea of a resource.</p>
<p>A resource is any meaningful object in your system, such as:</p>
<ul>
<li><p>A user</p>
</li>
<li><p>An order</p>
</li>
<li><p>A product</p>
</li>
<li><p>A comment</p>
</li>
</ul>
<p>Each resource is identified using a <strong>unique identifier</strong>, commonly a URL when REST is implemented over HTTP.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1770182691628/1cd78f0a-176e-44d5-928a-1a821fbc9b01.png" alt class="image--center mx-auto" /></p>
<p>For example:</p>
<pre><code class="lang-bash">/users/42
/orders/105
/products/9
</code></pre>
<p>These URLs represent <strong>things</strong>, not actions.</p>
<h2 id="heading-actions-are-separate-from-resources">Actions Are Separate From Resources</h2>
<p>In REST, actions are not part of the URL.</p>
<p>Instead, the action is <strong>expressed using the operation applied to the resource</strong>.</p>
<p>Think in terms of:</p>
<ul>
<li><p>Fetching a resource</p>
</li>
<li><p>Creating a resource</p>
</li>
<li><p>Updating a resource</p>
</li>
<li><p>Removing a resource</p>
</li>
</ul>
<p>This separation makes APIs predictable and easy to reason about.</p>
<h2 id="heading-what-is-representation">What is Representation?</h2>
<p>A resource itself is abstract.</p>
<p>What the client receives is a representation of that resource.</p>
<p>The same resource can be represented in different formats:</p>
<ul>
<li><p>JSON</p>
</li>
<li><p>XML</p>
</li>
<li><p>CSV</p>
</li>
</ul>
<p>The client can request the format it understands, and the server responds if it supports it.</p>
<p>This allows REST APIs to serve:</p>
<ul>
<li><p>Web apps</p>
</li>
<li><p>Mobile apps</p>
</li>
<li><p>Other backend services</p>
</li>
</ul>
<p>without changing the core resource model.</p>
<h2 id="heading-rest-is-not-tied-to-http">REST is Not Tied to HTTP</h2>
<p>One important thing many people miss:</p>
<p><strong>REST is not bound to HTTP</strong>.</p>
<p>REST only cares that:</p>
<ul>
<li><p>Resources are clearly identified</p>
</li>
<li><p>Actions are well-defined</p>
</li>
<li><p>Representations are transferred between the client and server</p>
</li>
</ul>
<p>In theory, REST can work over:</p>
<ul>
<li><p>HTTP</p>
</li>
<li><p>Messaging Systems</p>
</li>
<li><p>Even non-network interfaces</p>
</li>
</ul>
<p>However, in practice, REST fits extremely well with HTTP.</p>
<h2 id="heading-why-rest-works-so-well-with-http">Why REST Works So Well With HTTP</h2>
<p>HTTP already provides everything REST needs:</p>
<ul>
<li>Clear operations, Resource addressing, Status reporting</li>
</ul>
<p>This natural alignment is why REST over HTTP became so popular.</p>
<p>Example</p>
<pre><code class="lang-bash">GET /students/1
</code></pre>
<p>This means:</p>
<ul>
<li><p>/students/1 → identifies the resource</p>
</li>
<li><p>GET → specifies the action</p>
</li>
</ul>
<p>The client asks for the <strong>current state</strong> of the resource, and the server responds with a representation.</p>
<h2 id="heading-why-rest-over-http-is-widely-used">Why REST Over HTTP Is Widely Used</h2>
<p>One major reason REST over HTTP dominates is <strong>tooling</strong>.</p>
<p>You get a lot for free:</p>
<ul>
<li><p>Easy testing with tools like curl or Postman</p>
</li>
<li><p>Built-in caching via proxies and CDNs</p>
</li>
<li><p>Load balancing at the network layer</p>
</li>
<li><p>Monitoring and tracing support</p>
</li>
<li><p>Transport-level security using HTTPS</p>
</li>
</ul>
<p>These existing tools reduce the effort required to build and operate APIs at scale.</p>
<h2 id="heading-common-downsides-of-rest-over-http">Common Downsides of REST Over HTTP</h2>
<p>REST over HTTP is powerful, but it’s not perfect.</p>
<p>Some real-world limitations include:</p>
<ul>
<li><p>Extra overhead from text-based payloads</p>
</li>
<li><p>Repeated serialization and deserialization</p>
</li>
<li><p>Verb limitations in certain environments</p>
</li>
<li><p>Inefficiency for chatty or streaming workloads</p>
</li>
<li><p>Tight coupling to HTTP semantics</p>
</li>
</ul>
<p>Because of these trade-offs, REST is not always the best choice for every use case.</p>
<h2 id="heading-when-rest-is-a-good-fit">When REST Is a Good Fit</h2>
<p>REST works very well when:</p>
<ul>
<li><p>You are exposing public APIs</p>
</li>
<li><p>Clients are diverse (web, mobile, services)</p>
</li>
<li><p>Caching is important</p>
</li>
<li><p>Simplicity and readability matter</p>
</li>
<li><p>Requests are stateless</p>
</li>
</ul>
<p>This is why REST remains dominant for most web-facing systems.</p>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>REST is not about exposing endpoints — It’s about <strong>modeling systems around resources and representations</strong>.</p>
<p>When used correctly, REST leads to APIs that are:</p>
<ul>
<li><p>Easy to understand</p>
</li>
<li><p>Easy to consume</p>
</li>
<li><p>Easy to scale</p>
</li>
</ul>
<p>But like any architectural style, REST is a tool — not a rule.</p>
<p>Choosing it should be based on system needs, not trends.</p>
<hr />
<p>If you enjoyed this, I write simple blogs on backend systems, databases, and system design.<br />You can follow me here to catch the next one.</p>
]]></content:encoded></item><item><title><![CDATA[Difference between Sharding and Partitioning]]></title><description><![CDATA[Sharding vs Partitioning: What’s the Real Difference?
As applications grow, databases often become the first bottleneck. Queries slow down, writes queue up, and suddenly the system that worked fine yesterday starts struggling today.
Two common techni...]]></description><link>https://blogs.sumanprasad.in/difference-between-sharding-and-partitioning</link><guid isPermaLink="true">https://blogs.sumanprasad.in/difference-between-sharding-and-partitioning</guid><category><![CDATA[backend]]></category><category><![CDATA[System Design]]></category><category><![CDATA[Databases]]></category><category><![CDATA[architecture]]></category><category><![CDATA[scalability]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Sat, 31 Jan 2026 06:01:26 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769838673431/ca5097f7-5e9d-4b59-b959-4a009c0d4818.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2 id="heading-sharding-vs-partitioning-whats-the-real-difference">Sharding vs Partitioning: What’s the Real Difference?</h2>
<p>As applications grow, databases often become the first bottleneck. Queries slow down, writes queue up, and suddenly the system that worked fine yesterday starts struggling today.</p>
<p>Two common techniques used to scale databases are <strong>Partitioning</strong> and <strong>Sharding</strong>.</p>
<p>They sound similar, are often used together, and are frequently confused — but they solve slightly different problems.</p>
<p>Let’s break them down in a simple, practical way.</p>
<h2 id="heading-why-do-databases-need-to-scale">Why Do Databases Need to Scale?</h2>
<p>A database usually starts its life on a single machine. That machine has limited CPU, memory, disk, and network capacity.</p>
<p>As usage increases, the database experiences:</p>
<ul>
<li><p>More write traffic</p>
</li>
<li><p>More read traffic</p>
</li>
<li><p>More stored data</p>
</li>
</ul>
<p>At first, we try <strong>vertical scaling</strong> — upgrading the machine. But hardware has limits. When one machine can no longer handle the load, we need a different approach.</p>
<p>That’s where <strong>horizontal scaling</strong> enters the picture.</p>
<h2 id="heading-horizontal-scaling-in-databases">Horizontal Scaling in Databases</h2>
<p>Horizontal scaling means <strong>distributing data across multiple database servers</strong> so that no single machine becomes a bottleneck.</p>
<p>Instead of one database handling everything, multiple databases share the load.</p>
<p>This is the foundation on which both <strong>partitioning</strong> and <strong>sharding</strong> are built.</p>
<h2 id="heading-what-is-partitioning">What Is Partitioning?</h2>
<p><strong>Partitioning</strong> is about <strong>splitting data into smaller logical pieces</strong>.</p>
<p>All partitions may still live on:</p>
<ul>
<li><p>The same database server, or</p>
</li>
<li><p>Different servers</p>
</li>
</ul>
<p>But conceptually, the data is divided.</p>
<p>Example: Table Partitioning</p>
<p>Imagine a <code>users</code> table with millions of rows. Instead of storing everything together, the database can split it like:</p>
<ul>
<li><p>Users with IDs 1–1M</p>
</li>
<li><p>Users with IDs 1M–2M</p>
</li>
<li><p>Users with IDs 2M–3M</p>
</li>
</ul>
<p>Each chunk is a <strong>partition</strong>.</p>
<p>The database knows where each partition lives and routes queries accordingly.</p>
<p>Key Points About Partitioning</p>
<ul>
<li><p>It is mainly a <strong>data organization technique</strong></p>
</li>
<li><p>Often managed by the <strong>database engine</strong></p>
</li>
<li><p>Improves query performance and manageability</p>
</li>
<li><p>Does not always imply multiple machines</p>
</li>
</ul>
<h2 id="heading-what-is-sharding">What Is Sharding?</h2>
<p><strong>Sharding</strong> is about <strong>distributing data across multiple database servers</strong>.</p>
<p>Each server stores <strong>only a subset of the total data</strong> and handles queries for that subset.</p>
<p>That server is called a <strong>shard</strong>.</p>
<p>Example: User-Based Sharding</p>
<p>Suppose you have:</p>
<ul>
<li><p>Shard A → users with IDs ending in 0–4</p>
</li>
<li><p>Shard B → users with IDs ending in 5–9</p>
</li>
</ul>
<p>Each shard:</p>
<ul>
<li><p>Stores different data</p>
</li>
<li><p>Handles its own reads and writes</p>
</li>
<li><p>Scales independently</p>
</li>
</ul>
<h3 id="heading-key-points-about-sharding">Key Points About Sharding</h3>
<ul>
<li><p>Sharding is an <strong>architectural decision</strong></p>
</li>
<li><p>Each shard is usually a <strong>separate database instance</strong></p>
</li>
<li><p>Enables true horizontal scaling</p>
</li>
<li><p>Requires routing logic in the application or middleware</p>
</li>
</ul>
<h2 id="heading-how-sharding-and-partitioning-work-together">How Sharding and Partitioning Work Together</h2>
<p>How Sharding and Partitioning Work Together</p>
<p>A common real-world setup:</p>
<ul>
<li><p>The database is <strong>sharded across machines</strong></p>
</li>
<li><p>Each shard internally <strong>uses partitions</strong> to manage its data</p>
</li>
</ul>
<p>For example:</p>
<ul>
<li><p>3 shards (3 database servers)</p>
</li>
<li><p>Each shard has 4 partitions</p>
</li>
</ul>
<p>So the system has:</p>
<ul>
<li><p><strong>3 shards</strong></p>
</li>
<li><p><strong>12 partitions total</strong></p>
</li>
</ul>
<h2 id="heading-advantages-of-sharding">Advantages of Sharding</h2>
<p>Sharding unlocks capabilities that a single database cannot provide:</p>
<ul>
<li><p>Handles very high read and write traffic</p>
</li>
<li><p>Increases total storage capacity</p>
</li>
<li><p>Improves fault isolation</p>
</li>
<li><p>Enables independent scaling per shard</p>
</li>
</ul>
<h2 id="heading-challenges-of-sharding">Challenges of Sharding</h2>
<p>Sharding comes with trade-offs:</p>
<ul>
<li><p>Operational complexity increases</p>
</li>
<li><p>Cross-shard queries are expensive</p>
</li>
<li><p>Transactions across shards are harder</p>
</li>
<li><p>Rebalancing shards is non-trivial</p>
</li>
</ul>
<p>This is why sharding is usually adopted <strong>only when necessary</strong>.</p>
<h2 id="heading-when-should-you-use-what">When Should You Use What?</h2>
<p>When to use Partitioning</p>
<ul>
<li><p>Tables are large</p>
</li>
<li><p>Queries need optimization</p>
</li>
<li><p>You want better data organization</p>
</li>
</ul>
<p>When to use Sharding</p>
<ul>
<li><p>One database cannot handle the load</p>
</li>
<li><p>You need horizontal scalability</p>
</li>
<li><p>The system has reached hardware limits</p>
</li>
</ul>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Partitioning helps databases stay efficient.<br />Sharding helps systems grow beyond a single machine.</p>
<p>Most scalable systems use <strong>both</strong>, but only after carefully understanding the trade-offs.</p>
]]></content:encoded></item><item><title><![CDATA[Microservices]]></title><description><![CDATA[Microservices are everywhere today. Almost every modern system design discussion eventually reaches the question:
“Should we move to microservices?”
Before answering that, it’s important to understand what microservices really are, how they differ fr...]]></description><link>https://blogs.sumanprasad.in/microservices</link><guid isPermaLink="true">https://blogs.sumanprasad.in/microservices</guid><category><![CDATA[Microservices]]></category><category><![CDATA[System Design]]></category><category><![CDATA[backend]]></category><category><![CDATA[software development]]></category><category><![CDATA[distributed system]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Fri, 30 Jan 2026 05:07:02 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769749236372/b83ecad3-7a17-4d6d-8792-8471289afcf9.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Microservices are everywhere today. Almost every modern system design discussion eventually reaches the question:</p>
<p>“Should we move to microservices?”</p>
<p>Before answering that, it’s important to understand what microservices really are, how they differ from monoliths, and when they actually make sense.</p>
<h2 id="heading-what-are-microservices">What Are Microservices?</h2>
<p>In simple terms, <strong>microservices are small, independent services that focus on one business capability and communicate over a network</strong>.</p>
<p>Each Service:</p>
<ul>
<li><p>Has a clear responsibility</p>
</li>
<li><p>Can be developed, deployed, and scaled independently</p>
</li>
<li><p>Exposes functionality via APIs</p>
</li>
</ul>
<p>For example, in an e-commerce platform:</p>
<ul>
<li><p>One service handles order</p>
</li>
<li><p>One handles Payments</p>
</li>
<li><p>One handles Notifications</p>
</li>
<li><p>One handles Analytics</p>
</li>
</ul>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769747510449/5f3cb83d-9a0d-4f4d-83fb-a240b88a9c6f.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-what-is-a-monolith">What Is a Monolith?</h2>
<p>A monolith is a single application where all features live in one <strong>codebase</strong> and are deployed together.</p>
<p>In a monolith system:</p>
<ul>
<li><p>Payment logic</p>
</li>
<li><p>Notification logic</p>
</li>
<li><p>User management</p>
</li>
<li><p>Analytics</p>
</li>
</ul>
<p>are all part of the same application and run as one unit.</p>
<p>This is how <strong>most products start</strong>, and that’s not a bad thing.</p>
<p><img src="https://cdn.hashnode.com/res/hashnode/image/upload/v1769747943266/17af95ae-ec6b-4a7f-ae6f-d6574694388f.png" alt class="image--center mx-auto" /></p>
<h2 id="heading-why-monoliths-are-a-good-starting-point">Why Monoliths Are a Good Starting Point?</h2>
<p>Monoliths are often underestimated. They are actually great for early-stage systems.</p>
<p>Advantages of a monolith:</p>
<ul>
<li><p>Easy to build and understand</p>
</li>
<li><p>Simple testing and debugging</p>
</li>
<li><p>One deployment pipeline</p>
</li>
<li><p>Faster initial development</p>
</li>
<li><p>Easier local setup for developers</p>
</li>
</ul>
<p>For a small team or a new product, a monolith helps move fast.</p>
<h2 id="heading-problems-with-large-monoliths">Problems With Large Monoliths</h2>
<p>As the system grows, monoliths start showing cracks.</p>
<p>Common issues:</p>
<ul>
<li><p>Code becomes tightly coupled</p>
</li>
<li><p>A small change requires redeploying the whole system</p>
</li>
<li><p>A bug in one module can affect everything</p>
</li>
<li><p>Scaling one feature means scaling the entire application</p>
</li>
<li><p>Large codebases slow down development</p>
</li>
</ul>
<p>At this stage, teams start thinking about microservices.</p>
<h2 id="heading-moving-from-monolith-to-microservices">Moving From Monolith to Microservices</h2>
<p>Migrating to microservices is <strong>not a one-shot rewrite</strong></p>
<p>It is a gradual process.</p>
<p>A common approach:</p>
<ul>
<li><p>Identify a well-defined business area (e.g., Payments)</p>
</li>
<li><p>Extract it into a separate service</p>
</li>
<li><p>Expose it via an API</p>
</li>
<li><p>Repeat for other parts over time</p>
</li>
</ul>
<p>This way, the monolith slowly shrinks while services grow.</p>
<h2 id="heading-key-characteristics-of-microservices">Key Characteristics of Microservices</h2>
<p>Well-designed microservices share some common traits:</p>
<ul>
<li><p><strong>Autonomous:</strong> Each service can be developed and deployed independently.</p>
</li>
<li><p><strong>Business-focused:</strong> Services are designed around business needs, not technical layers.</p>
</li>
<li><p><strong>Loosely coupled:</strong> Services communicate through APIs, not shared databases.</p>
</li>
<li><p><strong>Independently scalable:</strong> Heavy-load services can be scaled without touching others.</p>
</li>
</ul>
<h2 id="heading-advantages-of-microservices">Advantages of Microservices</h2>
<ul>
<li><p><strong>Faster development:</strong> Small teams can work independently.</p>
</li>
<li><p><strong>Better scalability:</strong> Only the required service is scaled.</p>
</li>
<li><p><strong>Technology flexibility:</strong> Each service can use the most suitable tech stack.</p>
</li>
<li><p><strong>Fault isolation:</strong> A failing service can be isolated using patterns like circuit breakers.</p>
</li>
<li><p><strong>Reusability:</strong> Services can be reused across different applications.</p>
</li>
</ul>
<h2 id="heading-when-do-microservices-make-sense">When Do Microservices Make Sense?</h2>
<p>Microservices are a good fit when:</p>
<ul>
<li><p>The system is large and growing</p>
</li>
<li><p>Teams are becoming bottlenecks</p>
</li>
<li><p>Different parts scale very differently</p>
</li>
<li><p>Independent deployments are required</p>
</li>
<li><p>System reliability is critical</p>
</li>
</ul>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Microservices are about structuring systems around business capabilities, not just splitting code into smaller pieces.</p>
<p>Starting with a monolith and evolving into microservices is often the most practical path. The goal is not to follow trends, but to build systems that are maintainable, scalable, and reliable.</p>
<p>Microservices are not about complexity--they are about managing complexity correctly.</p>
]]></content:encoded></item><item><title><![CDATA[Decoding ACID Properties]]></title><description><![CDATA[Databases are used in systems where correctness really matters, like payment, bookings, inventory, user data, and more.
To make sure data stays reliable even under failures and heavy concurrency, databases follow a set of guarantees known as ACID:

A...]]></description><link>https://blogs.sumanprasad.in/decoding-acid-properties</link><guid isPermaLink="true">https://blogs.sumanprasad.in/decoding-acid-properties</guid><category><![CDATA[Databases]]></category><category><![CDATA[SQL]]></category><category><![CDATA[System Design]]></category><category><![CDATA[backend]]></category><category><![CDATA[Computer Science]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Mon, 26 Jan 2026 17:25:50 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1769448156901/04626c0a-2cb4-4a49-801d-4410b1fe1d05.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Databases are used in systems where correctness really matters, like payment, bookings, inventory, user data, and more.</p>
<p>To make sure data stays reliable even under failures and heavy concurrency, databases follow a set of guarantees known as ACID:</p>
<ul>
<li><p>Atomicity</p>
</li>
<li><p>Consistency</p>
</li>
<li><p>Isolation</p>
</li>
<li><p>Durability</p>
</li>
</ul>
<p>Let’s understand each of them using <strong>realistic but simple examples</strong>, starting with what goes wrong when they are missing.</p>
<h2 id="heading-atomicity">Atomicity</h2>
<h3 id="heading-what-does-atomicity-mean">What does atomicity mean?</h3>
<p>Atomicity ensures that a transaction is treated as one individual unit.</p>
<p>Either all its operations succeed, or none of them are applied.</p>
<h3 id="heading-what-is-the-problem-without-atomicity">What is the problem without Atomicity?</h3>
<p>Imagine an online wallet system.</p>
<p>A transaction does two things:</p>
<ol>
<li><p>Deduct Rs 500 from the user’s wallet</p>
</li>
<li><p>Add Rs 500 to the merchant’s wallet</p>
</li>
</ol>
<p>Now, try to imagine that the user’s wallet is <strong>debited</strong> and the <strong>system crashes</strong> before crediting the merchant, then what will be the scenario? The scenario will be Users loses money, and the merchant <strong>never receives</strong> it. This partial update creates incorrect data.</p>
<h3 id="heading-correct-behavior-with-atomicity">Correct Behavior (With Atomicity)</h3>
<p>With atomicity, if both updates succeed, then the transaction commits, and if any step fails, then <strong>everything is rolled back.</strong> So either both wallets are updated, or no wallet is changed at all. This guarantees correctness even during <strong>crashes or errors.</strong></p>
<h2 id="heading-consistency">Consistency</h2>
<p>In simple terms, data must always follow rules</p>
<h3 id="heading-what-consistency-means">What Consistency Means?</h3>
<p>Consistency ensures that <strong>database rules are never violated</strong>. Every successful transaction moves the database from one valid state to another valid state.</p>
<h3 id="heading-what-is-the-problem-without-consistency">What is the problem without Consistency?</h3>
<p>Consider a library system in which the rule says, “ A book cannot be issued if available copies are zero. ”</p>
<p>Now try to imagine that</p>
<ul>
<li><p>Available copies = 0</p>
</li>
<li><p>transaction still issues the book</p>
</li>
</ul>
<p>So the result will be</p>
<ul>
<li><p>Copies become -1</p>
</li>
<li><p>Data no longer makes sense</p>
</li>
</ul>
<h3 id="heading-correct-behavior-with-consistency">Correct Behavior (With Consistency)</h3>
<p>The database checks rules (constraints) before committing. If the rule is violated, the transaction is rejected. So either the book is issued correctly, or the transaction fails, and the data remains unchanged. The database <strong>never allows invalid data</strong>.</p>
<h2 id="heading-isolation">Isolation</h2>
<p>In simple terms, safe concurrency</p>
<h3 id="heading-what-does-isolation-mean">What Does Isolation Mean?</h3>
<p>Isolation ensures that <strong>multiple transactions running at the same time do not interfere with each other</strong>.</p>
<p>Each transaction behaves as if it were running alone.</p>
<h3 id="heading-what-is-the-problem-without-isolation">What is the problem without Isolation?</h3>
<p>Consider a concert ticket system that has a total seats 100 and two users try to book the <strong>last seat</strong> at the same time.</p>
<p>So, without isolation, both transactions read “ 1 seat available “ and both book successfully</p>
<p>Result</p>
<ul>
<li><p>101 tickets sold</p>
</li>
<li><p>System oversells</p>
</li>
</ul>
<h3 id="heading-correct-behavior-with-isolation">Correct Behavior (With Isolation)</h3>
<p>With isolation, the first transaction locks the seat and the second transaction waits; only one booking succeeds. So the other transaction either fails or sees updated data and stops. This prevents race conditions and data corruption.</p>
<h2 id="heading-durability">Durability</h2>
<p>In simple terms, data survives crashes</p>
<h3 id="heading-what-does-durability-mean">What Does Durability Mean?</h3>
<p>Durability guarantees that <strong>once a transaction is committed, its changes will not be lost</strong>, even if the system crashes immediately after.</p>
<h3 id="heading-what-is-the-problem-without-durability">What is the problem without Durability?</h3>
<p>Imagine placing an order on an e-commerce website.</p>
<ul>
<li><p>Payment succeeds</p>
</li>
<li><p>Order confirmation is shown</p>
</li>
<li><p>Server crashes before data is saved to disk</p>
</li>
</ul>
<p>After restart:</p>
<ul>
<li><p>Order is missing</p>
</li>
<li><p>Payment exists but order does not.</p>
</li>
</ul>
<p>This is not acceptable.</p>
<h3 id="heading-correct-behavior-with-durability">Correct Behavior (With Durability)</h3>
<p>With durability changes are written to non-volatile storage (disk). Transaction logs are flushed before commit. On restart, the database replays logs and restores state. So even after a power failure, crash, or restart, the committed order still exists.</p>
<p>ACID properties are not independent; they support each other. Atomicity prevents partial updates, Consistency ensures rules are respected, Isolation protects concurrent execution, and Durability preserves committed data. Removing even one of them can lead to serious data issues.</p>
<h3 id="heading-final-thoughts">Final Thoughts</h3>
<p>ACID properties are not theoretical concepts</p>
<p>They solve real problems that appear in everyday systems under load, failures, and concurrency.</p>
<p>Modern databases handle most of this automatically, but as engineers, understanding ACID helps us:</p>
<ul>
<li><p>Design better systems</p>
</li>
<li><p>Write safer transactions</p>
</li>
<li><p>Debug data issues confidently</p>
</li>
</ul>
<h2 id="heading-references">References</h2>
<ul>
<li><p><a target="_blank" href="https://www.bmc.com/blogs/acid-atomic-consistent-isolated-durable/">ACID Explained - BMC</a></p>
</li>
<li><p><a target="_blank" href="https://en.wikipedia.org/wiki/Consistency_\(database_systems\)">Consistency - Wikipedia</a></p>
</li>
<li><p><a target="_blank" href="https://www.ibm.com/docs/en/cics-ts/5.4.0?topic=processing-acid-properties-transactions">ACID properties of transactions</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Understanding Database Deadlocks and Their Resolution Methods]]></title><description><![CDATA[Database deadlocks are among the most challenging concurrency issues encountered in real-world production systems. While modern databases are designed to handle parallel workloads efficiently, deadlocks remain an unavoidable side effect of correct lo...]]></description><link>https://blogs.sumanprasad.in/understanding-database-deadlocks-and-their-resolution-methods</link><guid isPermaLink="true">https://blogs.sumanprasad.in/understanding-database-deadlocks-and-their-resolution-methods</guid><category><![CDATA[Databases]]></category><category><![CDATA[System Design]]></category><category><![CDATA[backend]]></category><category><![CDATA[software development]]></category><category><![CDATA[software]]></category><dc:creator><![CDATA[Suman Prasad]]></dc:creator><pubDate>Wed, 21 Jan 2026 05:15:56 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768665343191/7e310a06-f563-4159-8bf3-2569a5a07c4e.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Database deadlocks are among the most challenging <strong>concurrency</strong> issues encountered in real-world production systems. While modern databases are designed to handle parallel workloads efficiently, deadlocks remain an unavoidable side effect of <strong>correct</strong> <strong>locking and isolation</strong>.</p>
<p>To build a resilient application, it’s crucial to understand how deadlocks form, how databases detect them, and how they are resolved.</p>
<p>This article breaks down database deadlocks from the ground up, covering causes, detection techniques, resolution strategies, and real-world database behaviors.</p>
<h2 id="heading-what-is-database-deadlock">What is Database Deadlock?</h2>
<p>A <strong>database deadlock</strong> occurs when <strong>two or more transactions block each other indefinitely</strong>, each waiting for locks held by the other, creating a circular dependency that prevents any of the transactions from proceeding.</p>
<h3 id="heading-how-deadlock-pattern-look-like">How Deadlock Pattern look like?</h3>
<ul>
<li><p>Transaction <strong>A</strong> holds a lock on <strong>Resource X</strong> and waits for <strong>Resource Y</strong></p>
</li>
<li><p>Transaction <strong>B</strong> holds a lock on <strong>Resource Y</strong> and waits for <strong>Resource X</strong></p>
</li>
<li><p>A circular dependency forms, and progress stops completely</p>
</li>
</ul>
<p>Without intervention from the database engine, these transactions would wait forever.</p>
<p><strong>Example</strong></p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Transaction 1 (T1)</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> orders <span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'CONFIRMED'</span> <span class="hljs-keyword">WHERE</span> order_id = <span class="hljs-number">101</span>;
<span class="hljs-keyword">UPDATE</span> inventory <span class="hljs-keyword">SET</span> quantity = quantity - <span class="hljs-number">1</span> <span class="hljs-keyword">WHERE</span> product_id = <span class="hljs-number">50</span>;
<span class="hljs-keyword">COMMIT</span>;

<span class="hljs-comment">-- Transaction 2 (T2)</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> inventory <span class="hljs-keyword">SET</span> quantity = quantity - <span class="hljs-number">1</span> <span class="hljs-keyword">WHERE</span> product_id = <span class="hljs-number">50</span>;
<span class="hljs-keyword">UPDATE</span> orders <span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'CONFIRMED'</span> <span class="hljs-keyword">WHERE</span> order_id = <span class="hljs-number">101</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Transaction 1 (T1) Locks orders(order_id = 101) and waits for inventory(product_id = 50) and Transaction 2 (T2) Locks inventory(product_id = 50) and waits for orders(order_id = 101)</p>
<h2 id="heading-common-causes-of-deadlocks-in-practice">Common Causes of Deadlocks in Practice</h2>
<h3 id="heading-inconsistent-lock-ordering"><strong>Inconsistent Lock Ordering</strong></h3>
<p>When different transactions acquire locks on the same resources in different orders. Enforcing a consistent order (e.g., always lock Table A before Table B) is a primary prevention strategy.</p>
<p>Real-world Example: In a banking system, one service updates customer details and then logs the change, while another service logs the action first and then updates the customer.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Transaction A</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> customers <span class="hljs-keyword">SET</span> address = <span class="hljs-string">'New Address'</span> <span class="hljs-keyword">WHERE</span> customer_id = <span class="hljs-number">101</span>;
<span class="hljs-keyword">UPDATE</span> audit_logs <span class="hljs-keyword">SET</span> <span class="hljs-keyword">action</span> = <span class="hljs-string">'ADDRESS_UPDATE'</span> <span class="hljs-keyword">WHERE</span> customer_id = <span class="hljs-number">101</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<pre><code class="lang-sql"><span class="hljs-comment">-- Transaction B</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> audit_logs <span class="hljs-keyword">SET</span> reviewed = <span class="hljs-literal">true</span> <span class="hljs-keyword">WHERE</span> customer_id = <span class="hljs-number">101</span>;
<span class="hljs-keyword">UPDATE</span> customers <span class="hljs-keyword">SET</span> last_updated = <span class="hljs-keyword">NOW</span>() <span class="hljs-keyword">WHERE</span> customer_id = <span class="hljs-number">101</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Transaction A locks customers, and Transaction B locks audit_logs. Each waits for the other, causing circular dependency.</p>
<h3 id="heading-long-running-transactions"><strong>Long-running Transactions</strong></h3>
<p>Transactions that hold locks for extended periods increase the probability of conflict with other transactions.</p>
<p>Real-world Example: In a reporting system, a transaction reads large datasets, performs heavy computation, and then updates a summary table (all within transactions).</p>
<pre><code class="lang-sql"><span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> sales <span class="hljs-keyword">WHERE</span> sale_date <span class="hljs-keyword">BETWEEN</span> <span class="hljs-string">'2024-01-01'</span> <span class="hljs-keyword">AND</span> <span class="hljs-string">'2024-12-31'</span>;
<span class="hljs-comment">-- Application processes data for several seconds</span>
<span class="hljs-keyword">UPDATE</span> yearly_summary <span class="hljs-keyword">SET</span> total_sales = <span class="hljs-number">500000</span> <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">year</span> = <span class="hljs-number">2024</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Locks remain held during long processing, and other transactions block and form waiting chains. Deadlock probability increases under concurrency.</p>
<h3 id="heading-lock-escalation"><strong>Lock Escalation</strong></h3>
<p>Databases may automatically convert many fine-grained locks (like row-level) into fewer coarse-grained locks (like table-level) for performance efficiency, which can unexpectedly block other transactions and create deadlocks.</p>
<p>Real-world Example: In a Warehouse management system, bulk updates on inventory rows cause the database to escalate row locks into a table lock.</p>
<pre><code class="lang-sql"><span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> inventory
<span class="hljs-keyword">SET</span> last_checked = <span class="hljs-keyword">NOW</span>()
<span class="hljs-keyword">WHERE</span> warehouse_id = <span class="hljs-number">5</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Internally, what is happening is that Many row-level locks are acquired, the database escalates to a table-level lock, and other transactions attempting row updates are blocked.</p>
<p>Concurrent Transaction</p>
<pre><code class="lang-sql"><span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> inventory <span class="hljs-keyword">SET</span> quantity = quantity - <span class="hljs-number">1</span> <span class="hljs-keyword">WHERE</span> product_id = <span class="hljs-number">900</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Unexpected blocking and circular waits may form.</p>
<h3 id="heading-poorly-optimized-queries"><strong>Poorly Optimized Queries</strong></h3>
<p>Inefficient queries that perform large table or index scans can acquire locks, holding them for longer than necessary and increasing contention.</p>
<p>Real-world Example: In a customer support system, missing indexes cause full table scans during updates, locking more rows than necessary.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Problematic code</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> tickets
<span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'CLOSED'</span>
<span class="hljs-keyword">WHERE</span> created_at &lt; <span class="hljs-string">'2023-01-01'</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>It is performing a full table or index scan, a large number of acquired locks, and locks held longer than needed increase contention and lead to deadlocks.</p>
<p>Optimized Version</p>
<pre><code class="lang-sql"><span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> idx_tickets_created_at <span class="hljs-keyword">ON</span> tickets(created_at);

<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> tickets
<span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'CLOSED'</span>
<span class="hljs-keyword">WHERE</span> created_at &lt; <span class="hljs-string">'2023-01-01'</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<h3 id="heading-foreign-key-constraints"><strong>Foreign Key Constraints</strong></h3>
<p>Actions on a parent table might require the database to internally check or lock related child table records, creating hidden dependencies and potential lock chains.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Transaction A</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">DELETE</span> <span class="hljs-keyword">FROM</span> documents <span class="hljs-keyword">WHERE</span> doc_id = <span class="hljs-number">200</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<pre><code class="lang-sql"><span class="hljs-comment">-- Transaction B</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> permissions (doc_id, user_id, <span class="hljs-keyword">role</span>)
<span class="hljs-keyword">VALUES</span> (<span class="hljs-number">200</span>, <span class="hljs-number">10</span>, <span class="hljs-string">'EDITOR'</span>);
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Deleting a document requires checking related permissions through foreign key constraints, which introduce implicit locks and create lock dependencies that remain invisible in application code.</p>
<h2 id="heading-deadlock-detection">Deadlock Detection</h2>
<p>The database detects deadlocks automatically using the most famous approach, <strong>wait-for-graph</strong> algorithm.</p>
<h3 id="heading-wait-for-graph-algorithm">Wait-for Graph Algorithm</h3>
<ul>
<li><p>A wait-for graph is a directed graph used by databases to model lock dependencies between transactions.</p>
</li>
<li><p>Each node represents an active transaction.</p>
</li>
<li><p>Each <strong>directed edge (T₁ → T₂)</strong> means <em>Transaction T₁ is waiting for a resource held by Transaction T₂</em>.</p>
</li>
</ul>
<p><strong>Q. Why do databases use Wait-for Graphs?</strong></p>
<p>Lock-based systems naturally create waiting relationships. Tracking these relationships visually makes <strong>deadlock detection efficient</strong>. A deadlock is present if and only if a <strong>cycle exists</strong> in the graph.</p>
<p><strong>Q. How is the Graph Built?</strong></p>
<ul>
<li><p>When a transaction requests a lock that cannot be granted:</p>
<ul>
<li>The database adds an edge from the waiting transaction to the holding transaction.</li>
</ul>
</li>
<li><p>The graph is <strong>dynamic</strong> and updates as locks are acquired or released.</p>
</li>
<li><p>Only <strong>blocked transactions</strong> participate in the graph.</p>
</li>
</ul>
<h3 id="heading-deadlock-detection-rule">Deadlock Detection Rule</h3>
<ul>
<li><p>The database regularly checks (usually every few seconds) all transactions currently waiting on locks.</p>
</li>
<li><p>It builds a wait-for graph showing which transaction blocks which other based on active resource requests and holdings.</p>
</li>
<li><p>Graph traversal algorithms then scan for cycles, declaring a deadlock when one is detected.</p>
</li>
</ul>
<h3 id="heading-detection-frequency-by-database-system">Detection Frequency by Database System</h3>
<p>Different database systems use varying detection intervals to balance overhead with responsiveness.</p>
<ul>
<li><p><strong>PostgreSQL</strong>: Checks for deadlocks every 1 second by default after the <code>deadlock_timeout</code> period.</p>
</li>
<li><p><strong>MySQL (InnoDB)</strong>: Uses immediate detection for simple two-transaction deadlocks, but falls back to periodic checking every ~5 seconds for complex scenarios.</p>
</li>
<li><p><strong>SQL Server</strong>: Runs deadlock detection every 5 seconds by default, but can drop to as low as 100 milliseconds under high contention</p>
</li>
</ul>
<h2 id="heading-deadlock-resolution">Deadlock Resolution</h2>
<p>Once a deadlock is detected (for example, using the wait-for graph algorithm), the database must break the <strong>circular dependency.</strong></p>
<p>To do this, the database first selects a <strong>victim transaction</strong>. The choice is made carefully to minimize system impact. Typically, the database prefers terminating the transaction that has performed the least amount of work, as rolling it back requires fewer resources. In many systems, newer transactions are also favored as victims under the assumption that older transactions are closer to completion. Some databases additionally support transaction priorities, allowing lower-priority or background tasks to be aborted before critical operations.</p>
<p>Once the victim is chosen, the database <strong>rolls back the transaction</strong>, releasing all locks held by it. This immediately allows the remaining blocked transactions to continue execution. The rollback preserves atomicity and ensures the database remains in a consistent state.</p>
<p>From an application perspective, deadlocks are not exceptional failures but expected concurrency events. Applications should be designed to <strong>catch deadlock errors and retry the transaction</strong>, often with a small randomized backoff to avoid repeated collisions. In complex workflows, partial rollbacks using savepoints may also be used to limit lost work, although full rollbacks are more common during deadlock resolution.</p>
<p>While careful transaction design can reduce the likelihood of deadlock, most modern databases rely on <strong>detection and resolution</strong> rather than strict prevention, as eliminating deadlocks entirely is impractical in high-concurrency environments. The key guarantee provided by the database is that, after resolution, progress resumes safely without violating consistency or isolation.</p>
<h2 id="heading-deadlock-prevention-strategies">Deadlock Prevention Strategies</h2>
<p>Although deadlocks cannot be completely eliminated in concurrent systems, their frequency and impact can be significantly reduced through careful transaction design and system-level practices. One of the most effective techniques is enforcing a consistent lock ordering across the application. When all transactions acquire locks on shared resources in the same sequence, circular wait conditions are avoided entirely, making deadlocks structurally impossible for those code paths. For example, if an application always updates the users table before the orders table, deadlocks caused by reversed lock ordering are avoided:</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Consistent ordering (users → orders)</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> <span class="hljs-keyword">users</span> <span class="hljs-keyword">SET</span> last_login = <span class="hljs-keyword">NOW</span>() <span class="hljs-keyword">WHERE</span> user_id = <span class="hljs-number">10</span>;
<span class="hljs-keyword">UPDATE</span> orders <span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'PROCESSED'</span> <span class="hljs-keyword">WHERE</span> order_id = <span class="hljs-number">500</span>;
<span class="hljs-keyword">COMMIT</span>;

<span class="hljs-comment">-- Problems arise only when different transactions reverse this order.</span>
</code></pre>
<p>Another critical strategy is minimizing the scope and duration of transactions. Transactions that hold locks for long periods - especially while performing heavy computation, waiting for user input, or calling external services—dramatically increase contention. By keeping transactions short and limiting them strictly to database operations, locks are released quickly, reducing the chance of conflicts with other concurrent transactions.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Read and process outside the transaction</span>
<span class="hljs-keyword">SELECT</span> * <span class="hljs-keyword">FROM</span> reports <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">year</span> = <span class="hljs-number">2024</span>;

<span class="hljs-comment">-- Short write transaction</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> report_summary <span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'READY'</span> <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">year</span> = <span class="hljs-number">2024</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Query performance also plays a major role in deadlock prevention. Poorly optimized queries that scan large portions of tables or indexes tend to acquire more locks and hold them longer than necessary. Proper indexing, selective queries, and efficient execution plans help reduce lock footprints and improve overall concurrency. Faster queries mean shorter lock lifetimes, which directly lowers deadlock probability.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Index to avoid full table scan</span>
<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">INDEX</span> idx_tickets_status <span class="hljs-keyword">ON</span> tickets(<span class="hljs-keyword">status</span>);

<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> tickets <span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'CLOSED'</span> <span class="hljs-keyword">WHERE</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'RESOLVED'</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Understanding <strong>implicit locking behavior</strong>, particularly with foreign key constraints, is equally important. Operations on parent tables often require the database to internally lock related child records to maintain referential integrity. When these hidden dependencies are not accounted for, transactions may unintentionally acquire locks in conflicting orders. Designing transactions with awareness of these relationships helps prevent unexpected lock chains.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Parent table update</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> documents <span class="hljs-keyword">SET</span> title = <span class="hljs-string">'Final Draft'</span> <span class="hljs-keyword">WHERE</span> doc_id = <span class="hljs-number">200</span>;
<span class="hljs-comment">-- Implicitly checks/locks permissions via FK</span>
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>Finally, applications should be built with the assumption that deadlocks can still occur under extreme concurrency. Implementing safe retry mechanisms with backoff ensures that when a deadlock does happen, it is handled transparently without user-facing errors. In practice, the most robust systems combine thoughtful transaction design with resilient retry logic, treating deadlocks as a manageable aspect of concurrency rather than a critical failure.</p>
<pre><code class="lang-sql">def execute_with_retry(txn_func, retries=3):
    for attempt in range(retries):
        try:
            return txn_func()
        except DeadlockError:
            if attempt == retries - 1:
                raise
            time.sleep(random.uniform(0.1, 0.5))
</code></pre>
<h2 id="heading-how-specific-database-handle-deadlock">How Specific Database Handle Deadlock?</h2>
<p>Different database engines handle detection and resolution differently, based on their design goals and performance trade-offs. Understanding these differences is important when tuning systems or debugging production issues.</p>
<h3 id="heading-sql-server">SQL Server</h3>
<p>SQL Server uses a <strong>lock monitor thread</strong> that periodically scans for deadlocks by analyzing wait relationships between sessions. When a deadlock is found, SQL Server chooses a victim based on <strong>deadlock priority and estimated rollback cost</strong>.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Set deadlock priority</span>
<span class="hljs-keyword">SET</span> DEADLOCK_PRIORITY <span class="hljs-keyword">LOW</span>;
</code></pre>
<p>Example deadlock scenario:</p>
<pre><code class="lang-sql"><span class="hljs-keyword">BEGIN</span> TRAN;
<span class="hljs-keyword">UPDATE</span> employees <span class="hljs-keyword">SET</span> <span class="hljs-keyword">role</span> = <span class="hljs-string">'Senior'</span> <span class="hljs-keyword">WHERE</span> emp_id = <span class="hljs-number">77</span>;
<span class="hljs-comment">-- waits for payroll</span>
<span class="hljs-keyword">UPDATE</span> payroll <span class="hljs-keyword">SET</span> salary = salary + <span class="hljs-number">10000</span> <span class="hljs-keyword">WHERE</span> emp_id = <span class="hljs-number">77</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>When SQL Server resolves a deadlock, the victim transaction is rolled back with an error:</p>
<pre><code class="lang-sql">Transaction (Process ID xx) was deadlocked on <span class="hljs-keyword">lock</span> resources <span class="hljs-keyword">and</span> has been chosen <span class="hljs-keyword">as</span> the deadlock victim.
</code></pre>
<h3 id="heading-mysql">MySQL</h3>
<p>MySQL’s InnoDB engine takes a more <strong>aggressive approach</strong> to deadlock detection. For simple deadlock patterns, detection happens <strong>immediately</strong> when a lock request is made. For more complex cases, InnoDB falls back to timeout-based detection.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Enable deadlock detection and logging</span>
<span class="hljs-keyword">SET</span> <span class="hljs-keyword">GLOBAL</span> innodb_deadlock_detect = <span class="hljs-keyword">ON</span>; <span class="hljs-comment">-- Usually default, use only if disabled</span>
<span class="hljs-keyword">SET</span> <span class="hljs-keyword">GLOBAL</span> innodb_lock_wait_timeout = <span class="hljs-number">50</span>; <span class="hljs-comment">-- The default value, can be adjusted if needed</span>
<span class="hljs-keyword">SET</span> <span class="hljs-keyword">GLOBAL</span> innodb_print_all_deadlocks = <span class="hljs-keyword">ON</span>; <span class="hljs-comment">-- **Recommended** to log all deadlocks to the error log</span>
</code></pre>
<p>Example deadlock in MySQL</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Transaction 1</span>
<span class="hljs-keyword">START</span> <span class="hljs-keyword">TRANSACTION</span>;
<span class="hljs-keyword">UPDATE</span> wallets <span class="hljs-keyword">SET</span> balance = balance - <span class="hljs-number">500</span> <span class="hljs-keyword">WHERE</span> user_id = <span class="hljs-number">42</span>;
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> ledger (user_id, amount, <span class="hljs-keyword">type</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">42</span>, <span class="hljs-number">-500</span>, <span class="hljs-string">'DEBIT'</span>);

<span class="hljs-comment">-- Transaction 2</span>
<span class="hljs-keyword">START</span> <span class="hljs-keyword">TRANSACTION</span>;
<span class="hljs-keyword">INSERT</span> <span class="hljs-keyword">INTO</span> ledger (user_id, amount, <span class="hljs-keyword">type</span>) <span class="hljs-keyword">VALUES</span> (<span class="hljs-number">42</span>, <span class="hljs-number">500</span>, <span class="hljs-string">'CREDIT'</span>);
<span class="hljs-keyword">UPDATE</span> wallets <span class="hljs-keyword">SET</span> balance = balance + <span class="hljs-number">500</span> <span class="hljs-keyword">WHERE</span> user_id = <span class="hljs-number">42</span>;
</code></pre>
<p>InnoDB detects the deadlock, selects the transaction with lower rollback cost, and aborts it with:</p>
<pre><code class="lang-sql">ERROR 1213 (40001): Deadlock found when trying to get <span class="hljs-keyword">lock</span>; try restarting transaction
</code></pre>
<h3 id="heading-postgresql">PostgreSQL</h3>
<p>PostgreSQL uses a <strong>lazy deadlock detection approach</strong>. Instead of checking for deadlocks immediately, it waits for a configurable timeout period before initiating detection. The assumption is that most lock waits are short-lived and will resolve naturally without requiring expensive graph analysis.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Configure deadlock detection behavior</span>
<span class="hljs-keyword">SET</span> deadlock_timeout = <span class="hljs-string">'1s'</span>;
<span class="hljs-keyword">SET</span> log_lock_waits = <span class="hljs-keyword">on</span>;
</code></pre>
<p>When a transaction waits longer than deadlock_timeout, PostgreSQL builds a <strong>wait-for graph</strong> and checks for cycles. If a deadlock is detected, one transaction is aborted and the others continue.</p>
<pre><code class="lang-sql"><span class="hljs-comment">-- Example deadlock scenario</span>
<span class="hljs-keyword">BEGIN</span>;
<span class="hljs-keyword">UPDATE</span> seats <span class="hljs-keyword">SET</span> <span class="hljs-keyword">status</span> = <span class="hljs-string">'HELD'</span> <span class="hljs-keyword">WHERE</span> seat_id = <span class="hljs-number">12</span>;
<span class="hljs-comment">-- waits for another transaction</span>
<span class="hljs-keyword">UPDATE</span> payments <span class="hljs-keyword">SET</span> amount = amount + <span class="hljs-number">250</span> <span class="hljs-keyword">WHERE</span> booking_id = <span class="hljs-number">9001</span>;
<span class="hljs-keyword">COMMIT</span>;
</code></pre>
<p>If PostgreSQL detects a deadlock, it terminates one transaction with an error like:</p>
<pre><code class="lang-sql">ERROR: deadlock detected
</code></pre>
<p>This approach minimizes CPU overhead under normal workloads but may result in slightly longer waits before resolution.</p>
<h2 id="heading-final-thoughts">Final Thoughts</h2>
<p>Database deadlocks are a natural result of concurrent access in multi-user systems, not a database flaw. Modern databases detect and resolve them automatically, but well-designed applications reduce their frequency through consistent locking, short transactions, and proper indexing. Ultimately, treating deadlocks as expected events and handling them with retry logic is key to building reliable and scalable systems.</p>
]]></content:encoded></item></channel></rss>