Fail loudly: a plea to stop hiding bugs

Posted: 2025-09-20

Introduction

We often make the mistake of hiding logical errors in our software to make it seem more robust. The thinking is simple and seductive: ignore the unexpected condition, prevent a crash, let the program continue.

But the end result is the opposite: guaranteeing that our system’s invariants always hold becomes more difficult.

Logical errors ≠ normal error conditions

Logical errors are conditions that should never be allowed to happen.

This is different from normal error conditions, which are expected. Examples:

The user specified a path that doesn’t exist.
A URL could not be loaded.
The configuration denied the requested operation.

This text doesn’t say anything about those conditions, which should be handled gracefully. This text, instead, focuses on the bugs we should not handle.

Two examples

A few recent cases during the implementation of an AI agent at Google motivated me to write this.

In both cases the intention was laudable: make the software more reliabile; prevent runtime errors; avoid losing information (in invalid conditions).

But the end result is the opposite: it may mask very unexpected situations, making it harder to ensure the reliability of our software:

Example 1: Silent UI Failure

We had to implement a TypeScript browser UI that fetches “messages” from a local backend and renders them in the browser. We had an if condition validating an important expected invariant (regarding order of these messages). When the invariant doesn’t hold, we just log the unexpected situation to the console and recover; otherwise, this would have caused a runtime error (“reading properties of undefined”, or similar).

But if the invariant doesn’t hold… something must have gone seriously wrong in the backend or in the protocol, where maybe messages have not been propagated correctly.

Example 2: Observability and unregistered conversations

In the backend we added an “observability” layer to track the success rate of all invocations by logging conversation messages to a central database,

We wrote some code handling impossible invalid cases, where a conversation isn’t registered with the observability layer, by inventing (mostly empty) metadata for it. A comment explained the logic:

Rather than log an error, we create new metadata entry for this conversation. This allows us to monitor “orphan” messages.

But if a conversation isn’t registered… something must have gone seriously wrong. Are there places where conversations are created that we didn’t identify? Are our listeners being called correctly?

Ironically, this attempt to make the observability layer more robust compromises its integrity. How can we trust a monitoring layer that silently fixes its own data?

As systems grow, correctness drops exponentially

Hiding potential logical errors offers a tempting short-term productivity boost; It feels easier than fixing logical bugs. The software kinda works instead of crashing.

But unreliability compounds. Consider a simplified model of a system with n independent components. If each has a probability P of being correct, the chance of entire system being correct is Pⁿ. While real-world components aren’t truly independent, this illustrates how small imperfections can cascade into massive system-wide failures.

A system with 50 sub-components, each with a 99.5% correctness probability, has an overall correctness probability of roughly 78% (0.995⁵⁰).
At 500 sub-components, that probability plummets to a shocking 8% (0.995⁵⁰⁰).

Depending on how you count, Google systems consist of thousands of sub-components.

Large systems demand extreme rigor. We must do everything possible to ensure the invariants of each component are never broken. The first step is to never hide evidence of incorrectness.

Default to crashing loudly

Once you declare something an invariant, your code must treat it as one. Do not add complexity to handle cases where it breaks. Let exceptions bubble up (e.g., avoid “catch-all-exceptions” statements). Let the program crash. A crash provides a clean, immediate, and unmissable signal that a fundamental assumption has been violated.

For example, if an element must always be in a Python dictionary, use map[key] (raises KeyError exception if absent), not map.get(key) (silently returns None).

Technically, an exception to this rule is adding logic to validate the invariants to your program. Checking that your invariants hold can be justifiable complexity. Just make sure you don’t try to recover; simply raise an exception or crash.

Prefer compile-time guarantees. Enforcing invariants through the type system is much better than through tests or runtime checks.

Isolation in multi-tenant systems

Multi-tenant systems (such as a shared RPC server) bring an important exception: logical errors affecting a single instance (e.g., an RPC request; or data entity) should probably not crash the shared instance but only the isolation boundary. The request should still fail with a loud 500 HTTP error, ideally logging the full stack and alerting the service owners.

I say “probably” because full global outages may be better than ignoring logical errors and continuing to serve, allowing silent data corruption or privacy violations.

Imagine a Google Chat outage where a combination of subtle factors (including a software version mismatch due to a datacenter coming back online after days of repairs) caused messages to be routed to the wrong recipients. A system-wide crash would have been a far prefered outcome.

Concrete types yield better maintainability: Why the common advice of “express your functions in generic types” doesn’t always yield better maintainability.
Source code: Minimizing merge conflicts Teams making fast progress on any shared codebase inevitably face (source code) merge conflicts. This text describes general strategies to improve the situation.
Up: Essays