Sent to me via mailbag:

Have you used TLA+ to model resilience as in resilience engineering (systems in general)? — Feodrippe

**Resilience Engineering**, to my understanding, refers to building systems that can function in the presence of disruptions. For example, if a backend service gets overloaded, its dependencies can automatically switch to fallback logic to keep serving customers and give the service time to recover.

I’ve not done much work *personally* with “resilience engineering”. But the idea comes up a lot in formal methods, in the form of resilience *properties*. These are goals of the systems that we want (and use resilience engineering) to achieve. And I think the details of resilience properties are interesting enough for a deep dive. I’ll use syntax, but it should be applicable to other specification languages, too.

Let `x`

be some measurable value (temperature, number of servers online, queue size, weight, etc) and S be a range of desirable values. The goal of the system is to keep x in the range, or in TLA+ `x \in S`

. We’ll call this being “in control”: `Control == x \in S`

.

The strongest possible property is the invariant `STRONG: []Control`

, “in *all* states x is in control”. If for a single state x is outside control, this property fails. This is undesirable, as often the system *starts* with x out of the range and has to get it in first. So `[]Control`

will automatically fail on the starting step.

(A stronger property is one that’s harder to guarantee.)

On the other side of the spectrum, `WEAK: <>Control`

is the weakest possible property: “in at least one state, x is in control”. This allows x to start outside the range and go inside, but it only needs to do so once, then everything can break down. While there are cases for both the strongest and weakest versions of the prop, we usually want something in between.

So first, we could say that we don’t *start* in control, but once we reach it, we never lose control (`TRIPWIRE`

). In TLA+, this is `<>Control /\ [](Control => []Control)`

. Notice that by conjoining a property to `WEAK`

we make the property stronger. `TRIPWIRE`

is better than `STRONG`

but still too sensitive. If there’s some random variance in `x`

, a tiny fluctuation might trigger the tripwire and then immediately fail the property.

We could instead allow a bit of jostle but say eventually we *converge* on being in control (`STABLE`

). Formally, “eventually always” we’re in control, or there’s eventually a point where, from that point on, we are forever in control. We can write STABLE as `<>[]Control`

.

Finally, resilience. It’s possible to achieve stability in a closed system, where the only thing that matters is the other components of the system, but in an open system, the outside world has a vote too. The temperature can be thrown out of the right range by a sudden heat wave, or a storm knocks out one of our servers, and so puts us out of control. Often the best we can do is guarantee we can *recover* from a shock: `RESILIENT`

is the property that we always eventually return to control, or in TLA+ `[]<>Control`

.

Formal methods isn’t just about how properties relate to systems. It’s also useful to talk about how properties relate to other properties. `A => B`

means that A is stronger than B:^{1} it is plausible for a system to satisfy B without satisfying A, but not the other way around.

Property strength is a **partial order**: Most properties aren’t comparable to each other. A and B can be incomparable even if they both imply C. Fortunately, partial orders are transitive: `A => B`

and `B => C`

means that `A => C`

. Here’s the current spectrum

```
STRONG => TRIPWIRE => STABLE => RESILIENT => WEAK
```

The important one here is `STABLE => RESILIENT`

, or `<>[]Control => []<>Control`

. To explain this intuitively, if you eventually converge on reaching control, then nothing in the future kicks you out of control, so you will successfully recover from all (zero) shocks.

These are relationships between different types of properties. We can also create stronger or weaker properties within the same type by tweaking the size of the target range. Let’s redefine all of our properties to be parameterized on a set of values, so f.ex `STABLE(1..3) == <>[]Control(1..3) = <>[](x \in 1..3)`

. Then any system satisfying `STABLE(1..3)`

*also* satisfies `STABLE(1..10)`

, so the former property is stronger. For every property P on our spectrum,

```
IF narrow \subseteq wide
THEN P(narrow) => P(wide)
```

This follows because every value of `narrow`

is also a value of `wide`

, so being restricted to `narrow`

means we’re restricted to `wide`

too. This *doesn’t* hold if `narrow`

isn’t a subset, even if it’s much smaller: `STABLE(1..3)`

doesn’t automatically guarantee `STABLE(2..1000)`

.

Given the resilient system, the *ideal* property to guarantee is `STABLE(narrow)`

. In practice, that’s infeasible, because the reason we’re designing a resilient system in the first place is because we’re expecting disruptions. There’s going to be some level of strength that’s beyond our reach, either impossible or too expensive. There are two weaker properties we can aim for:

`RESILIENT(narrow)`

`STABLE(wide)`

These are *not* necessarily comparable. A system can satisfy (1) without (2) or (2) without (1). So we can make *both* of them goals:

```
Goal == RESILIENT(narrow) /\ STABLE(wide)
```

There’s one more way we can weaken a property: we can *precondition* it on another property.

```
STABLE(wide) => RESILIENT(narrow)
```

“*If* we can guarantee we’re stable in the `wide`

range, *then* we can guarantee we’re resilient in `narrow`

range”. This creates a “metastable state”: As long as we remain in a certain value range, we can always return to control. If we’re pushed too far, though, we leave the metastable state and all bets are off.

We can define multiple metastable states in our system:^{2}

```
Metastable(wide, narrow) == STABLE(wide) => RESILIENT(narrow)
Props ==
\A <<x, y>> \in {<<w1, n1>>, <<w2, n2>>}:
Metastable(x, y)
```

What about saying “we recover *quickly*”? That’s trickier. One thing we can do is say there’s an implicit “time horizon” in our system, so that a failure of `[]<>Resilient`

could mean both “we never recover” and “we recover the next day, which is equivalent to ‘never recovering’ for our purposes.”

If we want stricter time bounds, we need to embed a notion of time *in* our model of the system, and then track how many time units various system actions take. This is a LOT hard to do so you probably won’t do it unless you absolutely need to.

I should probably turn this into a learntla topic or something

Also reminder that I’m running a TLA+ workshop on March 20. Use the code `C0MPUT3RTHINGS`

for 15% off!