Fail loudly: a plea to stop hiding bugs
- Posted: 2025-09-20
- 6 min
We often make the mistake of hiding logical errors in our software to make it seem more robust. The thinking is simple and seductive: ignore the unexpected condition, prevent a crash, let the program continue. This hampers software maintainability and correctness.
Logical errors ≠ normal error conditions
Logical errors are conditions that should never be allowed to happen.
This is different from normal error conditions, which are expected. Examples:
- The user specified a path that doesn’t exist.
- A URL could not be loaded.
- The configuration denied the requested operation.
This text doesn’t say anything about those conditions, which should be handled gracefully. This text, instead, focuses on the bugs we should not handle.
Two examples
A few recent cases during the implementation of an AI agent at Google motivated me to write this.
In both cases the intention was laudable: make the software more reliabile; prevent runtime errors; avoid losing information (in invalid conditions).
But the end result is the opposite: it may mask very unexpected situations, making it harder to ensure the reliability of our software:
Example 1: Silent UI Failure
We had to implement a TypeScript browser UI that fetches “messages”
from a local backend and renders them in the browser. We had an
if condition validating an important expected invariant
(regarding order of these messages). When the invariant doesn’t hold, we
just log the unexpected situation to the console and recover; otherwise,
this would have caused a runtime error (“reading properties of
undefined”, or similar).
But if the invariant doesn’t hold… something must have gone seriously wrong in the backend or in the protocol, where maybe messages have not been propagated correctly.
Example 2: Observability and unregistered conversations
In the backend we added an “observability” layer to track the success rate of all invocations by logging conversation messages to a central database,
We wrote some code handling impossible invalid cases, where a conversation isn’t registered with the observability layer, by inventing (mostly empty) metadata for it. A comment explained the logic:
Rather than log an error, we create new metadata entry for this conversation. This allows us to monitor “orphan” messages.
But if a conversation isn’t registered… something must have gone seriously wrong. Are there places where conversations are created that we didn’t identify? Are our listeners being called correctly?
Ironically, this attempt to make the observability layer more robust compromises its integrity. How can we trust a monitoring layer that silently fixes its own data?
As systems grow, correctness drops exponentially
Hiding potential logical errors offers a tempting short-term productivity boost; it feels easier than fixing them. Instead of crashing, the software kind-of works.
But unreliability compounds. Consider a simplified model of a system with n independent components. If each has a probability P of being correct, the chance of entire system being correct is Pⁿ. While the probability of failure of subcomponents of real systems is rarely truly independent, this illustrates how small imperfections can cascade into massive system-wide failures.
A system with 50 sub-components, each with a 99.5% correctness probability, has an overall correctness probability of roughly 78% (0.995⁵⁰).
At 500 sub-components, that probability plummets to a shocking 8% (0.995⁵⁰⁰).
Depending on how you count, Google systems consist of thousands of sub-components.
Large systems demand extreme rigor. We must do everything possible to ensure the invariants of each component are never broken. The first step is to never hide evidence of incorrectness.
Default to crashing loudly
Once you declare something an invariant, your code must treat it as one. Do not add complexity to handle cases where it breaks. Let exceptions bubble up. Let the program crash. A crash provides a clean, immediate, and unmissable signal that a fundamental assumption has been violated.
Technically, an exception to this rule (of not adding complexity to handle invariant validations) is adding logic to validate the invariants to your program. Checking that your invariants hold can be justifiable complexity. Just make sure you don’t try to recover; simply raise an exception or crash.
Prefer compile-time guarantees. Enforcing invariants through the type system is much better than through tests or runtime checks.
Anti-patterns
This section lists a few specific anti-patterns that I’ve seen in practice.
Dictionary access
If an element is expected in a dictionary/map, use an access API that raises an exception rather than silently return.
Python: Use
map[key](raisesKeyErrorexception if absent), notmap.get(key)(silently returnsNone).C++: Use
map.at(key)(raisesstd::out_of_rangeifkeyis absent), notmap[key](silently inserts a new element).
Catch-all exceptions
Avoid “catch-all-exceptions” statements. Catch only specific exceptions that correspond to normal error conditions.
In Python, this is particularly pernicious: catch-all exceptions silence assertion validations.
Cover all enum values explicitly
In code that processes an enum, prefer to explicitly list all enum values, rather than using catch-all defaults.
Consider an enum with values Visit,
OnlyList, and Ignore.
In C++, write this:
switch (value) {
case MyEnum::Visit:
HandleVisit(...);
break;
case MyEnum::OnlyList:
case MyEnum::Ignore:
break;
}
Don’t write this:
switch (value) {
case MyEnum::Visit:
HandleVisit(...);
break;
default:
break;
}
Nor this:
if (value == MyEnum::Visit)
HandleVisit(...);
This has the advantage that if a new value is added to the enum, as
long as you compile with standard warning flags (-Wall or
-Wswitch), the compiler will emit alert you:
'NewValue' not handled in switch. The programmer adding the
enum value will have to make a decision: does the new value deserve
special handling.
In Python, do:
match value:
case MyEnum.Visit:
handle_visit(...)
case MyEnum.OnlyList:
pass
case MyEnum.Ignore:
pass
case _:
raise ValueError(...)
This is less optimal than C++ (where the error is detected during compilation), but reasonable testing coverage should still flag the need to make a decision.
Isolation in multi-tenant systems
Multi-tenant systems (such as a shared RPC server) bring an important exception: logical errors affecting a single instance (e.g., an RPC request; or data entity) should probably not crash the shared instance but only the isolation boundary. The request should still fail with a loud 500 HTTP error, ideally logging the full stack and alerting the service owners.
I say “probably” because full global outages may be better than ignoring logical errors and continuing to serve, allowing silent data corruption or privacy violations.
Imagine a Google Chat outage where a combination of subtle factors (including a software version mismatch due to a datacenter coming back online after days of repairs) caused messages to be routed to the wrong recipients. A system-wide crash would have been a far prefered outcome.
Related
Concrete types yield better maintainability: Why the common advice of “express your functions in generic types” doesn’t always yield better maintainability.
Source code: Minimizing merge conflicts: Teams making fast progress on any shared codebase inevitably face (source code) merge conflicts. This text describes general strategies to improve the situation.
Avoid Boolean Types: In source code or databases, prefer enumerations over boolean types. Use boolean types only for temporary expressions.