top of page

Faster isn't better. Predictable is.

Why variance, not average, is the metric that decides whether service operations hold under load.


Eight years ago, on a continuous galvanising line in South Africa, I stood on a gantry holding the wrong components in my hand and watched a planned shutdown overrun.


Photograph   The components, on shutdown day, ArcelorMittal Vanderbijlpark
Photograph   The components, on shutdown day, ArcelorMittal Vanderbijlpark

I was a mechanical engineer, two years out of university. My project was to improve heat distribution across the steel strip — a feedback loop tying thermal distribution to line speed control, so the galvanised coating would adhere properly. The shutdown was to install the centrepiece: a thermal imaging scanner inside one of the furnaces, monitoring strip temperature as it passed through the zinc bath. Everything had been scheduled around a single planned shutdown window. The supplier arrived a day late. Half the components they brought were the wrong specification.



Photograph   On site, during the shutdown, ArcelorMittal Vanderbijlpark
Photograph   On site, during the shutdown, ArcelorMittal Vanderbijlpark


I didn't know much about procurement at the time. I knew about thermal engineering, hydraulics and tolerances and how a galvanising line is supposed to behave when it is working. What I learned that week was that I had inherited a problem someone upstream of me had never bothered to engineer for. A supplier's average performance — the number that would have shown up on whatever scorecard procurement was reviewing — was almost irrelevant.


What mattered was the variance: the days they arrived late, the components that came wrong. Those are the days that take down furnaces. Years later, working in industries that look nothing alike, I still ask the same question I started asking on that gantry: where is the constraint, and what is its variance?


Most operations leaders don't ask either half. They ask whether the supplier hit SLA last quarter, whether the price is competitive, whether the relationship feels healthy. Those are management questions. The engineering questions sit underneath them, and they're the ones that decide whether your operation holds under load.


Since then, I have scaled the same idea up. At Takealot, I ran the operations behind 12,000-plus sellers on a marketplace. At that scale, procurement isn't about managing relationships — it can't be. There are too many sellers, too many SKUs, too many distributions to track individually. You engineer a platform instead: rules, instrumentation, exception handling, statistical governance. The same principles I'd learned on the gantry. Two very different shapes of system. Same toolkit.


I'm now scaling Workwize's global procuring service operations platform — orchestrating hardware procurement across 118+ countries. The system is past proof-of-concept; we're scaling. Which is exactly the moment to install engineering discipline, before crisis demands it. Building the right system at the right time is half the work of operations.


Here's what running service operations as a production system actually means in practice.


Figure 1   The distribution is the truth                                 Reliability engineering applied to procurement
Figure 1   The distribution is the truth Reliability engineering applied to procurement


01 Suppliers are nodes, not relationships.


A supplier is a component in a system that needs to behave a certain way under load. The most useful thing I know about any supplier is not their average performance — it is the distribution around their average. A supplier delivering in 10 days ± 2 is more valuable than a supplier delivering in 7 days ± 5, every single time, because the second one forces me to carry buffer for failures I cannot predict.


Reliability engineers have measured this for decades. Mean time between failures. Mean time to recovery. They are not exotic concepts. They are exactly the right concepts for thinking about a supplier network, and almost no operations scorecard uses them.


The framework developed is three letters: Reliability, Observability, Predictability. It is a reliability-engineering framework wearing a procurement hat. Reliability is the distribution of outcomes. Observability is whether I can see what is happening before it shows up in those outcomes. Predictability is whether tomorrow's behaviour resembles yesterday's enough to plan around.


Figure 2   The ROP framework                                                 Three properties that make a supplier scalable
Figure 2   The ROP framework Three properties that make a supplier scalable

The suppliers you keep are the ones who score well on those three, even when their averages look worse than alternatives. The suppliers you retire are the ones with great averages and unbounded variance, because those are the suppliers that take the system down at the worst possible moment.


02 The constraint moves. Your job is to find where it lives today.


Theory of Constraints is fifty years old and most operations functions still don't apply it correctly. The principle is simple: every system has one bottleneck at a time, every improvement to a non-bottleneck is wasted effort, and the bottleneck moves once you fix it. The discipline is in not fixing the loudest problem until you have checked whether it is the binding one.


In one of our country operation last year, our largest reseller was the visible problem. OTD was declining. Tickets were stacking. The natural reflex was to escalate, push harder, demand recovery plans. The engineering reflex was to model where the constraint actually lived. Once we did, the answer was structural — the supplier wasn't the bottleneck, the supplier's distribution model was. No amount of pressure on a single node fixes a topology problem. We replaced the architecture, not the supplier.


This is unglamorous work. There is no quarterly award for "did not chase the loud problem." But the alternative — fixing whatever is screaming this week — is how procurement teams end up running on adrenaline forever while the underlying system never improves.


03 Variance reduction beats average improvement.


Walter Shewhart figured this out at Bell Labs in 1924. Deming carried it into Japan and rebuilt their manufacturing base on it. The idea: most variation in a stable process is common-cause, baked into how the system is designed, and cannot be removed by fixing individual cases. The mistake operations teams make, constantly, is treating common-cause variation as if every instance were special-cause — chasing each individual incident as a unique problem when the data is screaming that the system is producing those incidents on schedule.

Figure 3   Common cause. Special cause.               Statistical process control, applied to a supplier scorecard
Figure 3   Common cause. Special cause. Statistical process control, applied to a supplier scorecard

The corollary is the more useful insight: when a real special cause does appear, you have to recognise it against the noise. You can't, if you have been treating noise as signal the whole time.


I look at every supplier dashboard with this distinction in mind. If a supplier misses an SLA, I want to know whether that miss falls inside their normal distribution or outside it. Inside, it is a process question — does the process need to change, or do we need to plan around its real behaviour? Outside, it is a special-cause investigation — what changed, and is the change one-time or structural? Different questions, different responses, and most operations teams ask neither.



Why this matters for how operations gets built?


None of these principles are new. Reliability engineers have used them for decades; manufacturing engineers for a century. What is new is treating them as the default toolkit for service operations — not the rescue toolkit after something has already broken.


Operations is an engineering discipline. I'm building mine that way. So far, the numbers say it works.

Comments


bottom of page