Table of Contents
Two truths surface the moment a DDoS hits: time compresses, and options shrink. A runbook that lives only in a wiki will not help your team make the next correct move under pressure. What works is an executable runbook—short, explicit, and designed for muscle memory—so responders can act before dashboards even finish loading.
We’ve built and battle‑tested DDoS playbooks for organisations that span e‑commerce, media, and critical SaaS. The patterns are consistent: clarity beats cleverness, defaults save minutes, and communications shape how the incident is remembered. Below is the field guide we use to turn chaos into coordinated action and to leave every incident with a sharper edge.
What “Executable” Really Means in a DDoS Runbook
An executable runbook is built for doing, not reading. It prioritises single‑screen steps over prose, decisions over definitions, and named roles over abstract owners. It embeds the guardrails that keep well‑intentioned fixes from creating collateral damage. That’s why we borrow lessons from why incident playbooks fail under pressure to design steps that survive chaos, not just audits.
At its core, an executable runbook:
- Encodes minimum viable decisions for the first five minutes and the next thirty.
- Calls out sources of truth for telemetry, traffic samples, and change control.
- Makes rollback obvious and fast.
- Bakes in communication cadences to align executives, customer‑facing teams, and responders.
Design Principles that Should Not be Compromised
Start with the runbook format itself. Every action block fits on a phone screen, names a role (not a team), lists a success signal, and links to the exact console view or command. Each branch ends with a confirm rollback step and a timestamped note so post‑incident learning has a clean trail.
When stress spikes, responders should never build steps from memory—they should follow steps they’ve rehearsed. If you need concrete patterns to copy, these practical incident response playbook examples show how to structure roles, steps, and rollbacks.
The First Five Minutes: Triage Under Fire
Before you touch knobs, you must classify the attack. Is it volumetric L3/L4 or transactional L7? That call controls everything downstream: which metrics matter, which vendors you activate, and which controls you trust. Teams often find it helpful to review real-world methods for stopping a DDoS attack so the playbook decisions feel grounded in actual scenarios. Make that determination fast and write it down.
The “first five” is a cadence, not a sprint. Our five‑minute loop sits inside the seven phases of incident response, keeping triage scoped while we protect critical journeys. We start the clock when a credible alert fires or a business KPI dives. The goal is to stabilise decision‑making, not solve the entire incident.
The Five‑Minute Cadence
First a quick look at:
- SYN error rate and connection resets (L3/L4 smoke).
- p95/p99 latency and origin error codes (L7 pressure).
- Bot/automation ratios and unique IP churn.
- Ingress edge utilisation vs. origin saturation.
Then a call on:
- suspected layer,
- blast radius (which journeys break), and
- initial mitigation path. The first five minutes end with a timestamped decision: “Treating as L7. Focus: Checkout + API v2. Path: Tight rate limiting + WAF ruleset shift.”
Signal Pack: What to Watch and Where
After triage, responders need a compact signal pack—the shortlist of charts and samples that govern action. This isn’t a NOC wall; it’s a pragmatic set that anchors the next 30 minutes.
Start with edge telemetry: request rates by path, HTTP response code stacks, cache hit ratio, and TLS handshakes per second. Add origin‑side views: thread pools, queue depths, upstream error budgets, and database connection saturation. Finally, track business SLOs—successful checkouts, login success, and content plays—so mitigations don’t “win the war, lose the user.”
Sampling that Shortens Arguments
Every branch in the runbook points to the exact query or packet capture to sample. For L3/L4, that’s PCAP slices at the edge and flow logs from upstream providers. For L7, it’s top URLs, user‑agent clusters, JA3/JA4 fingerprints, and IP entropy. We always include an “annotate now” line so responders mark when each mitigation changes the slope of these signals.
Decision Tree: Rate Limiting, WAF Mode‑Shifts, Blackholing, Scrubbing
Start with the safest, most reversible move that meaningfully buys you time. Your decision tree should read left‑to‑right with neutral verbs and explicit rollbacks. For a government‑grade checklist of mitigations and reversals, see CISA’s DDoS response playbook overview.
Begin at the edge: tighten rate limits on the specific hot paths and push WAF mode‑shifts from monitor to block for signatures that match your traffic sample. If volumetric pressure threatens connectivity, coordinate upstream blackholing for non‑business‑critical prefixes; when customer traffic risks being collateral, activate your scrubbing centre with a pre‑shared profile and warm session keys.
Case studies like DDoS lessons from the deepseek incident show when scrubbing beats blackholing for API‑heavy traffic. This is the only place I deliberately mention the phrase stop DDoS attacks—because the goal isn’t zero traffic; it’s preserving critical journeys while you reduce attacker leverage.
How to Keep the Tree from Becoming a Maze
Each branch names the next observation to check and the one acceptable outcome that justifies moving deeper. Anything else routes you to rollback and a parallel branch. The result is fewer dead‑end mitigations and faster convergence on stable service.
Hold‑the‑Line Actions that Don’t Break Core Journeys
When pressure climbs, default to moves that turn sharp failure into soft degradation. Prioritise customer journeys over vanity metrics.
Open with cache sheltering: maximise TTLs on static assets and hot content, and favour stale‑while‑revalidate on CDNs to keep pages snappy when origin CPU flares. Then path partitioning: peel off marketing paths and heavy media to lighter infrastructure so transactional flows keep breathing room. Consider temporary feature flags for expensive calls (personalisation, real‑time recommendations), and lower concurrency on non‑critical batch jobs to free origin capacity.
The Outcomes to Track Here
Watch four curves tighten: origin 5xx, p95 latency on checkout or login, session starts for logged‑in users, and support‑queue inflow. If the first three improve while the last one doesn’t spike, hold‑the‑line is working. If not, that’s a rollback and a pivot to scrubbing or upstream controls.
Comms Templates: Execs, Customers, Frontline
Silence is a vacuum the incident will fill for you. Good comms keep stakeholders aligned, reduce duplicate pings, and reduce the risk of a premature “all clear.” Your runbook should embed three lightweight templates that anyone can send with names and times swapped in. For executive context during widespread outages, link a neutral news recap of the 2016 Dyn outage.
For executives, keep it material: impact, trend, next checkpoint, and your mitigation path. For customers, lead with the journey (“checkout latency for some users”), what you’re doing, and the next status window. For frontline teams—support and social—give exact phrases to reduce anxiety and inconsistency. Every template includes a hold‑time, an owner, and a next update timestamp to protect responder focus.
SLO Guardrails for Messaging
Set limits on what you’ll claim. No promising ETAs you don’t own. Anchor updates in SLOs: “checkout success back above 98%,” not “things look better.” That discipline trains the whole organisation to think in user outcomes.
Warm‑Start Checklist: Probes, Tokens, and Change Control
The worst time to discover a missing synthetic probe or stale API token is mid‑incident. I pre‑stage a warm‑start checklist in the runbook and review it monthly.
Warm‑start includes owning synthetic probes for your top three journeys (public and authenticated), a known‑good CDN config you can revert to in one click, pre‑approved WAF rules ready to flip, and cached credentials for upstream providers. Change windows and rollbacks are pre‑written so you avoid committee pauses when seconds matter.
Owning the First Safe Reversals
Every mitigation you might take—new rules, scrubbing, or blackholing—has its reversal scripted and tested. The checklist also calls for a 15‑minute stability hold after service returns to baseline, with alerts throttled so responders don’t chase ghosts.
Mitigation Patterns: When Each One Actually Helps
Not all mitigations earn their keep. A useful runbook explains the contexts in which a control pays back the noise it creates.
WAF mode‑shift shines for L7 floods against narrow endpoints, especially when bot fingerprints or path clustering are obvious. Edge rate limiting is fantastic for abuse that shares a user‑agent family or IP churn pattern, but I pair it with path‑specific limits so real users retain headroom.
Upstream blackholing is a last resort to protect the broader network when traffic to non‑critical prefixes explodes; it’s ugly but can save the business. Scrubbing centres move big iron into your corner, but they work best when you’ve pre‑registered routes, warmed handshakes, and can tolerate a short TTL on route changes. For a reminder of DNS‑level blast radius, The Verge explains how the Dyn DNS DDoS unfolded.
Beware the Confidence Traps
Three traps show up repeatedly: shipping a new WAF rule that quietly blocks legitimate mobile app versions; letting CDN shield traffic hide origin collapse; and assuming scrubbing activation means you can stop watching SLOs. Build tests for each trap right into the pattern description so nobody trips the same wire twice.
Post‑Incident Learning Loops that Stick
Incidents that end with “we should document this” never change behaviour. We schedule a hot wash within 48 hours while details are fresh and a follow‑up deep‑dive within two weeks to confirm fixes landed. Both are on the calendar from the moment we call “contained.”
In the hot wash, we harvest the exact timeline, the decision points we argued about, and the measurements that actually mattered. In the deep‑dive, we turn that into deltas: updates to the runbook, playbook tests for the next tabletop, and backlog items with owners and dates. We validate updates with a short drill that mirrors how cyber tabletop exercises work, so decisions get tested without risking production.
We also pick quarterly tabletop exercises to pressure-test runbooks with at least one drill focused on DDoS recovery and comms. We also reset the warm‑start checklist, rotate keys that touched incident tooling, and store artefacts for threat intel. That’s how you turn a bad day into permanent competence.
I’ve learned to name roles precisely—Edge Operator, Origin Operator, Traffic Analyst, Comms Lead, Exec Liaison—and to put the Clock Owner outside the keyboard. Their job is to timestamp decisions, call the five‑minute cadence, and enforce the next checkpoint. It’s the smallest process change with the biggest payoff.
The final design principle: make the runbook boring to execute. If responders can run it calmly in a tabletop, they can run it under fire. That’s the difference between a DDoS that becomes folklore and one that becomes a footnote.
A DDoS runbook isn’t a binder of best practices; it’s a collection of moves that protect customers while you regain control. If it can’t be executed in five‑minute loops, it isn’t finished yet. Design for decisions and rollbacks, not for cleverness.
When the next surge hits, your team doesn’t need a lecture. They need a first step, a safe second step, and a clear signal that tells them when to stop. Build for that—and pressure becomes practice rather than panic.
 
			         
 
 
 
