Benford’s Law – Can you detect fraud by the first digit?
Even in today’s world of powerful technological surveillance, fraud remains part of our lives. Corrupt leaders rise to power thanks to fraudulent voting, rogue employees embezzle their companies, and unscrupulous government officials steal from the state and the taxpayer. How can we catch these criminals without giving up too much of our civil liberties?
In some cases, a quirk of mathematics can come to the rescue. Imagine that I collect a list of the heights of all the mountains in the world. Now, I take these numbers and collect the first digits. How many ones are there? How many twos, threes, fours, etc?
As most of these mountains will be on different continents from each other, we can assume that their heights will not depend on each other in any way. This will make most people say that there should be equal amounts of each digit, between one and nine. Which is the wrong answer: actually there should be more ones than twos, more twos than threes, and so on, and this is true whether you measure those mountains in feet, metres or double decker buses. This is referred to as Benford’s Law (and the distribution of digit numbers is called the Benford distribution).
Why is this the case? No-one is completely sure why, but it seems to have something to do with logarithms. If you have two numbers A and B, the logarithm (or log) of A added to the log of B is the same as multiplying A by B, and taking the log of that number. If you don’t have a calculator, logarithms are an extremely convenient way of multiplying large numbers easily. Books of logarithmic tables (and handheld slide rules) were the state of the art in rapid calculation before the modern computing age.
Even as far back as 1881, the astronomer Simon Newcomb had noticed that the earlier pages of log tables (dealing with numbers beginning with 1) were much more worn than the other pages, an early sign of Benford’s Law.
The first digit of a number is related to its log. Taking the first digit of a collection of numbers is like throwing darts at a line on a logarithmic scale (see below). And those darts tend to land in the logs beginning with 1 and 2 more often than anywhere else. Logarithmic scales are defined by multiplication, not by addition like standard number lines. This means that the distance between log 1 and log 2 is not the same as the distance between 1 and 2. In fact, the distance between log 1 and log 2 is larger than the distance between log 2 and log 3 (to get from 1 to 2 only by multiplying, you must multiply by 2; to get from 2 to 3 by multiplying, you only need to multiply by 1.5). Also, the distance between log 10 and log 20 (or log 100 and log 200) is the same as the distance between log 1 and log 2. In the logarithmic world, it is the difference in multiplication that matters, not the difference in addition.
This is what is believed to give rise to Benford’s Law. In many cases, we can predict how many first digits will be ones, twos, threes and so on in a dataset without even having to look at it! The heights of mountains, the list of numbers that make up the Fibonacci sequence, the numbers in Bill Clinton’s tax return, the values of the Universe’s fundamental physical constants; all these datasets and more satisfy Benford’s Law.
So how does this help us with fraud? Tampering is not always immediately visible to someone looking at the raw data, but looking at the statistics of the first digits can help to identify when someone has been monkeying around. If someone rigs an election, chances are the polling stations show vote tallies that don’t start with enough ones, or start with too many fives. If you’re stealing from the company, the first digits of the account numbers will be suspicious. It’s even been suggested that the macroeconomic data Greece submitted to the EU before joining the Eurozone was fraudulent, as it didn’t satisfy Benford’s Law.
But it’s not as simple as that. There isn’t an iron-clad mathematical proof of Benford’s Law, although it does appear extremely often, and there are a lot of interesting explanations for why it happens. Also, there are some datasets that definitely will not satisfy Benford’s Law. If you have a dataset that doesn’t vary much (if for example we repeated our original experiment with the heights of human beings instead of mountains), or if there is an “artificial” reason for the first digit (telephone numbers in a given country often start with the same digit), then Benford’s Law definitely won’t work.
So, are we any further forward? Datasets that obey Benford’s Law are probably fraud-free, but datasets that don’t obey the Law might not be fraudulent. So when you hear Benford’s Law being used as a tool for fraud detection, remember that it is a tool that has its limits.