On averages, networks, and evidence – and how statistics quietly shifts the ground

Every statistical tool makes assumptions. Not hidden in the fine print, but built into the structure of what the tool can even say. An average assumes the group it describes has meaningful boundaries. A sample assumes the people who ended up in it were drawn fairly. A significance test assumes you know what question you are asking. Most of the time, these assumptions sit quietly underneath the number and nobody looks at them — the number arrives, and it travels on its own. These three results are about what happens when the assumption does the work and the number gets the credit.
The average that improved without anything improving
The assumption: that the boundaries around a group are fixed and neutral.
A hospital publishes its annual outcomes report showing survival rates up for both early-stage and late-stage cancer patients, with no new treatment introduced. The outcomes did not change; the numbers did.
In the 1980s, advances in medical imaging meant that patients who had previously been classified as early-stage were now, correctly, being identified as late-stage. The staging criteria had not changed and neither had the biology — doctors were simply seeing more clearly what was already there.
The mechanism behind what happened to the statistics is pure arithmetic. The patients being reclassified were the most severe early-stage cases, the ones with the worst prognosis within that group. Removing them raised the early-stage average, because the patients pulling it down were now gone. But those same patients were the least severe of the late-stage cases, so adding them raised that average too, because everyone being added was better off than those already there.
This is the Will Rogers Phenomenon, named after a joke that when Okies moved from Oklahoma to California, they raised the average intelligence of both states.
What it illustrates is that an average is not a neutral description of a group but an artifact of where you drew the boundaries around that group, so that changing the classification moves the averages without touching a single underlying outcome.
This matters well beyond oncology. School league tables, neighbourhood income statistics, performance benchmarks across teams — anything that compares group averages is vulnerable to the same effect. A reclassification that looks purely administrative can produce a statistical improvement that looks substantive. The average answered the question it was asked, but the question was quietly the wrong one.
The sample that was biased without anyone trying
The assumption: that the people you observe are a fair cross-section of everyone.
Think about the people you know, not all of them but the ones who come to mind easily — the ones who seem to be everywhere, know everyone, always have something on. If you were to compare yourself to the average of that group, you might feel a little less connected than most, a little more peripheral than you probably are. What you are actually doing is drawing a sample that was never going to be representative, and the distortion is a mathematical property of networks rather than a reflection of your social standing.
This is the Friendship Paradox. When you list your friends, who ends up on that list? People who were connected enough to reach you. Highly connected people appear on many such lists, across many people’s social circles; less connected people appear on very few. So when you pool across everyone’s friend lists and compute an average, the highly connected are counted again and again while the less connected barely register. The result holds in any network where some people are more connected than others, which is essentially every real network: your friends have more friends than you do on average, your colleagues have collaborated with more people than you have, and the accounts you follow online tend to have more followers than yours.
In each case, your local view of the network is not a representative slice of the whole but a sample weighted toward whoever was visible enough to reach you in the first place. The discomfort this produces can feel personal, but the cause is structural, and once you see the mechanism it mostly dissolves. You were not comparing yourself to the average person but to a sample that could only ever have been skewed in one direction.
The result that was significant but not convincing
The assumption: that a statement about data is also a statement about what you should believe.
A friend sends you a link to a large study finding that a common supplement reduces the risk of a particular condition. The trial was well-designed, the sample ran to thousands of participants, and the p-value came in below the conventional threshold of 0.05, so your friend is already ordering a bottle. Something still nags at you, though, because the supplement belongs to a category that has produced many failures before and the expected effect, if there was one, was always going to be modest — it feels like it should take more than this to be convincing. That feeling has a name, and it is pointing at something real.
A p-value answers a specific question: how surprising is this data if there is no effect at all? If the answer is very surprising, below the conventional threshold, we call the result significant, but that is all it means — it is a statement about the data under a particular assumption, not a verdict on whether the effect actually exists.
A Bayesian analysis begins from a different place. Rather than asking only how surprising the data is, it also asks what we already knew about the world before the study ran — how plausible was this kind of effect to begin with, given the mechanism proposed and the track record of similar interventions? If the answer is “not very”, because the biology is speculative or because the same class of supplements has failed repeatedly in previous trials, then the data has to do a great deal of work to shift your overall assessment. Statisticians call this starting point the prior probability, and when it is low, even a p-value that clears the conventional threshold may move it only modestly.
This is Lindley’s Paradox. With a large enough sample, a p-value can fall comfortably below 0.05 while a Bayesian analysis of the exact same data concludes that the evidence is not yet strong enough to believe the effect is real. The two frameworks are not disagreeing about the data; they are answering different questions, and as sample sizes grow, those questions can pull in opposite directions. Statistical significance is the answer to one particular question, and sometimes the question you actually need answered is a different one — a large, well-run trial can produce a significant result and still leave a careful analyst genuinely uncertain, because the test did its job while the interpretation needed more than the test alone could provide.
What ties them together
Each of these three results involves a different hidden assumption, and in each case the statistic performed exactly as designed. In the Will Rogers Phenomenon, averages depend entirely on where you draw the lines around groups, and the assumption that those lines are neutral does the damage rather than the arithmetic. In the Friendship Paradox, your local sample is systematically weighted toward whoever was visible enough to reach you, and the assumption that what you see is representative is never stated but is always there. In Lindley’s Paradox, evidence and belief are different currencies, and a significant result is a statement about data under an assumption rather than a direct measure of how much you should update what you believe.
In each case the number is right in the sense that the statistic performed exactly as designed, but the scaffolding had failed quietly: the assumption about what the tool was measuring, and whether that matched what you actually wanted to know.
Statistics is often taught as a correction to overconfident intuition, but the deeper lesson here is not that you reached too far. The ground you were standing on had already shifted without telling you, carrying the number along with it and leaving the assumption behind.
The Number Was Right. The Ground Beneath It Wasn’t. was originally published in DataDrivenInvestor on Medium, where people are continuing the conversation by highlighting and responding to this story.