Cloudflare Explains What Broke During Its Major Nov 18 Outage

Cloudflare has published a detailed account of what went wrong during the company’s widespread outage on November 18. The incident lasted several hours and left many websites around the world returning 5xx errors, blocking logins, and slowing down traffic routed through Cloudflare’s network.

It might have looked like a major attack from the outside, but the company says the real cause was something far more mundane: a faulty configuration file triggered by a database permissions change.

The outage began at 11:20 UTC and gradually pulled large parts of Cloudflare’s global edge network into a loop of failures and brief recoveries. By the time things stabilized at 17:06 UTC, the company had spent hours rolling back changes, restarting core systems and untangling a chain reaction that started with a single file growing larger than expected.

A Simple Database Change Sparked a Chain Reaction

Cloudflare says the problem started with a change inside one of its ClickHouse database clusters. As part of an ongoing permissions update, a query began returning duplicate rows. That query helped generate a “feature file” used by Cloudflare’s Bot Management system. Because of the duplicated data, the file doubled in size.

This file is distributed across Cloudflare’s entire network every few minutes. The machines that handle routing expected the file to stay below a certain size. When it exceeded that limit, the software failed. Once that bad file began circulating, parts of the network panicked and started returning 5xx errors.

The odd part was that the system didn’t fail consistently. Some database shards were still returning normal results. This meant every few minutes the file would alternate between a valid version and a broken one. Traffic would recover, then collapse again, which made the incident harder to diagnose.

Why It Initially Looked Like a Massive Attack

Cloudflare’s own status page briefly went down at the same time, even though it isn’t hosted on Cloudflare infrastructure. That coincidence, combined with the erratic error patterns, initially made the team think the network was under a large attack. It wasn’t until later that engineers traced the issue back to the configuration file and the database change.

Once the root cause became clear, Cloudflare stopped the propagation of the broken file, pushed a known good version into the system and restarted the proxy layer responsible for routing requests.

What Cloudflare Says Happens Next

The company says it’s reviewing several parts of its pipeline to prevent similar failures. That includes improving how internal configuration files are validated, adding more kill switches, reviewing error handling, and ensuring no single component produces enough debugging output to overwhelm the system during an incident.

Cloudflare also acknowledged that this was its most severe outage since 2019. The company apologized for the disruption and said incidents like this force them to re-evaluate fault tolerance across the entire network.

Source