StatisticsData PrivacyClassificationBayesian

When Murder Turns Out to Be Bad Statistics

David Junge

|July 2, 2024|6 min read

When Murder Turns Out to Be Bad Statistics

...or what a criminal court can learn from Data Privacy Classifications (nerd warning)

The human cost of bad statistics

Sally Clark, a UK solicitor, faced a harrowing ordeal in 1999 when she was wrongfully convicted of murdering her two infant sons. The prosecution's argument hinged on the notion that the likelihood of two siblings succumbing to Sudden Infant Death Syndrome (SIDS) was astronomically low, leading them to assert foul play.

But the statistical foundation of their case was deeply flawed. They erroneously treated the deaths as independent events. Later, statisticians revealed that the proper calculation, which accounted for genetic and environmental factors, indicated that the deaths were not as improbable as the prosecution claimed. In fact, during the 2003 retrial, these heroic statisticians demonstrated that the deaths were likely due to natural causes.

Sally Clark's tragic story is not unique. Amanda Knox was convicted in 2009 of murdering her roommate Meredith Kercher. However, after enduring multiple appeals and retrials, the Italian Supreme Court finally overturned her conviction in 2015, recognizing the misinterpretation of DNA statistics.

Similarly, Lucia de Berk, a Dutch nurse, was convicted in 2003 of multiple murders and attempted murders of her patients based on dubious statistical analysis of mortality rates in her wards. The prosecution's calculations suggested an implausibly high rate of deaths under her care, arousing suspicion. Yet, in 2006, her conviction was overturned, and by 2010, she was fully acquitted after a thorough review of the evidence and statistical analyses.

Both Sally, Amanda, and Lucia nearly lost their lives due to the judge's inability to establish the a priori probability of the event: that the poor souls were innocent.

Now, we may note that we are actually committing another statistical crime in selecting the people who survived, fittingly known as survival bias. For who knows how many souls we have lost that did not get the data right in the first place.

The Raven Paradox

The ability to assess the probability of an event in a given population is fundamental to our ability to determine the likelihood of the event. To unpack the somewhat convoluted argument, let's visit the Raven Paradox first described by the German philosopher Carl Gustav Hempel.

The paradox is named after the following example:

Hypothesis: All ravens are black.

Contrapositive: All non-black things are not ravens.

According to logic principles, the contrapositive statement is equivalent to the original hypothesis. Therefore, any evidence that supports the contrapositive should also support the original hypothesis.

Observation: Suppose you observe a green apple. According to the contrapositive, a green apple (a non-black thing) supports the hypothesis that all ravens are black.

Which leads us to conclude that:

Observing a green apple: Supports the hypothesis that all ravens are black.

A black crow observation: Also supports the hypothesis.

Intuitively, it seems odd that observing a green apple (which is not related to ravens at all) could somehow provide support for the hypothesis about ravens' color.

Now you might be thinking: there is a lot of stuff that is not black and not a raven. So, what are you going on about?

Suppose you (like me) are into Data Privacy Classification and have ever tried to do a regex to find social security numbers. In that case, you might have discovered that the regex that worked wonderfully when categorizing data from a municipality produced nothing but false positives when classifying data from a large windmill producer.

Not surprisingly, the actual density of social security numbers is much higher in a municipality than in a production company. The exact number of social security numbers in a given dataset is essential if we are to assess the likelihood that a social security number that our regex identified is actually a social security number.

Let's say you have a dataset with 1000 documents, and in 1 of them, there is a social security number. The regex might be 99% accurate; this will lead to 10 false positives and one positive for a total number of 11 documents flagged to contain social security numbers, or a quality of only 9% (1/11).

To reach a quality level of 95% (as is the Data & More standard), the precision of the classification must be 99.995% accurate. This will find 100 social security numbers out of 1 million documents and only five false positives if the prior probability of finding the social security numbers is 1/1000.

To be precise: the a priori probability of an event matters, if we should assess the likelihood of correctly identifying the event.

Bayesian reasoning

In Bayesian terms, this paradox can be explained through the concept of updating beliefs based on evidence. Here's how it works:

Prior Probability: Before observing any evidence, you have some prior belief about the hypothesis P(H).

Likelihood: You consider the likelihood of observing the evidence if the hypothesis is true P(E|H) and if it is false P(E|¬H).

Posterior Probability: After observing the evidence, you update your belief to form the posterior probability P(H|E).

When you observe a green apple, you are indirectly confirming the contrapositive (all non-black things are not ravens). Although this evidence is weak and doesn't strongly affect your belief in the hypothesis, it still technically provides some confirmation according to Bayesian updating.

But if you are a murder suspect (a black raven), it's pretty bad if all things unrelated start being another statistic that ever so slightly points to your guilt, which was more or less how the initial cases went. But here is the kicker...

Stop words: how to identify a false positive

Let's look at the example on social security numbers again, where the a priori density was 0.1% (1/1000), and the accuracy of the test (regex) was 99%. This gives us a 9% probability that a positive is actually a positive. This leads us to 91% of the data being false positives, which is much easier to identify: actually 91 times easier to identify than the true positive, if you do a second test.

In document classification, we use the fact that it is often easier to identify a false positive than a positive. Specifically, we start with words that indicate social security numbers, like "social security number" and "CPR," and many other start words. But we also have stop words, which is a way to identify false positives like "invoice number," "SKU," and "Unsubscribe."

In the end, it was the human stop words that freed the three women: signs of the improbability of being a murderer, or as Hempel would have put it, of not being a black raven.

./D

Also published on LinkedIn

David Junge

CTO & Co-founder