Fail loudly: a plea to stop hiding bugs

We often make the mistake of hiding logical errors in our software to make it seem more robust. The thinking is simple and seductive: ignore the unexpected condition, prevent a crash, let the program continue. This hampers software maintainability and correctness.

Logical errors ≠ normal error conditions

Logical errors are conditions that should never be allowed to happen.

This is different from normal error conditions, which are expected. Examples:

This text doesn’t say anything about those conditions, which should be handled gracefully. This text, instead, focuses on the bugs we should not handle.

Two examples

A few recent cases during the implementation of an AI agent at Google motivated me to write this.

In both cases the intention was laudable: make the software more reliabile; prevent runtime errors; avoid losing information (in invalid conditions).

But the end result is the opposite: it may mask very unexpected situations, making it harder to ensure the reliability of our software:

Example 1: Silent UI Failure

We had to implement a TypeScript browser UI that fetches “messages” from a local backend and renders them in the browser. We had an if condition validating an important expected invariant (regarding order of these messages). When the invariant doesn’t hold, we just log the unexpected situation to the console and recover; otherwise, this would have caused a runtime error (“reading properties of undefined”, or similar).

But if the invariant doesn’t hold… something must have gone seriously wrong in the backend or in the protocol, where maybe messages have not been propagated correctly.

Example 2: Observability and unregistered conversations

In the backend we added an “observability” layer to track the success rate of all invocations by logging conversation messages to a central database,

We wrote some code handling impossible invalid cases, where a conversation isn’t registered with the observability layer, by inventing (mostly empty) metadata for it. A comment explained the logic:

Rather than log an error, we create new metadata entry for this conversation. This allows us to monitor “orphan” messages.

But if a conversation isn’t registered… something must have gone seriously wrong. Are there places where conversations are created that we didn’t identify? Are our listeners being called correctly?

Ironically, this attempt to make the observability layer more robust compromises its integrity. How can we trust a monitoring layer that silently fixes its own data?

As systems grow, correctness drops exponentially

Hiding potential logical errors offers a tempting short-term productivity boost; it feels easier than fixing them. Instead of crashing, the software kind-of works.

But unreliability compounds. Consider a simplified model of a system with n independent components. If each has a probability P of being correct, the chance of entire system being correct is Pⁿ. While the probability of failure of subcomponents of real systems is rarely truly independent, this illustrates how small imperfections can cascade into massive system-wide failures.

Depending on how you count, Google systems consist of thousands of sub-components.

Large systems demand extreme rigor. We must do everything possible to ensure the invariants of each component are never broken. The first step is to never hide evidence of incorrectness.

Default to crashing loudly

Once you declare something an invariant, your code must treat it as one. Do not add complexity to handle cases where it breaks. Let exceptions bubble up. Let the program crash. A crash provides a clean, immediate, and unmissable signal that a fundamental assumption has been violated.

Technically, an exception to this rule (of not adding complexity to handle invariant validations) is adding logic to validate the invariants to your program. Checking that your invariants hold can be justifiable complexity. Just make sure you don’t try to recover; simply raise an exception or crash.

Prefer compile-time guarantees. Enforcing invariants through the type system is much better than through tests or runtime checks.

Anti-patterns

This section lists a few specific anti-patterns that I’ve seen in practice.

Dictionary access

If an element is expected in a dictionary/map, use an access API that raises an exception rather than silently return.

Catch-all exceptions

Avoid “catch-all-exceptions” statements. Catch only specific exceptions that correspond to normal error conditions.

In Python, this is particularly pernicious: catch-all exceptions silence assertion validations.

Cover all enum values explicitly

In code that processes an enum, prefer to explicitly list all enum values, rather than using catch-all defaults.

Consider an enum with values Visit, OnlyList, and Ignore.

In C++, write this:

switch (value) {
  case MyEnum::Visit:
    HandleVisit(...);
    break;
  case MyEnum::OnlyList:
  case MyEnum::Ignore:
    break;
}

Don’t write this:

switch (value) {
  case MyEnum::Visit:
    HandleVisit(...);
    break;
  default:
    break;
}

Nor this:

if (value == MyEnum::Visit)
  HandleVisit(...);

This has the advantage that if a new value is added to the enum, as long as you compile with standard warning flags (-Wall or -Wswitch), the compiler will emit alert you: 'NewValue' not handled in switch. The programmer adding the enum value will have to make a decision: does the new value deserve special handling.

In Python, do:

match value:
  case MyEnum.Visit:
    handle_visit(...)
  case MyEnum.OnlyList:
    pass
  case MyEnum.Ignore:
    pass
  case _:
    raise ValueError(...)

This is less optimal than C++ (where the error is detected during compilation), but reasonable testing coverage should still flag the need to make a decision.

Isolation in multi-tenant systems

Multi-tenant systems (such as a shared RPC server) bring an important exception: logical errors affecting a single instance (e.g., an RPC request; or data entity) should probably not crash the shared instance but only the isolation boundary. The request should still fail with a loud 500 HTTP error, ideally logging the full stack and alerting the service owners.

I say “probably” because full global outages may be better than ignoring logical errors and continuing to serve, allowing silent data corruption or privacy violations.

Imagine a Google Chat outage where a combination of subtle factors (including a software version mismatch due to a datacenter coming back online after days of repairs) caused messages to be routed to the wrong recipients. A system-wide crash would have been a far prefered outcome.