Does UK ONS COVID data pass the Benford Law sniff test?
In the following we will allude to the background rationale of Benford’s law that identifies a pattern of prevalence of leading place-values of a data set. We do this through the example of the leading place value of a series formed by incrementing an initial principal of £1 by £1 each day. We take a look then at a few examples from Number theory that encode nicely the Benford prevalence relationship then follow up with a couple of UK and US data sets that with some pivot table manipulation in google sheets mostly show compliance.
Benford’s Law is well known for adjudging the efficacy of a data set: whether or not the data items so categorised have been randomly generated or (not so) craftily curated. We explore Benford’s law by reviewing some well known examples from finance and number theory, then apply them to the UK government’s ONS data set of COVID, US job types and Michican lakes sizes and see how the granularity of the categorisation of the data appears to apparently affect its efficacy.
Benford’s prevalence of leading numbers law is summarised in the following table:
Looking at the log of the numbers from 1 to 10 we can see that the difference of 0.3 between log 2 and log 1 is the same as the difference between log 4 and log 2 or between log 4 and log 8. So Benford’s law says that if we look at the first digits of a set of numbers there should be as many 1s as first digits as 4,5,6 and 7 put together. As Colin Beverage puts it, “if your data is spread widely enough (and some other conditions apply), the lead digit of log(x) is (more or less) uniformly distributed”.
Those versed in the delights of compounded returns of finance will recognise the Benford probability decile distribution 1:30%, 2:18%, 3:12%, 4:10%, 5:8%,.. as the effective compounded return, R rates which arises from forming the log returns, log(1+r), built from that simple interest, r that delivers an increment of £1 each day on a principal of £1.
The continuous compounded (log) return, R for a stock rising incrementally by £1 each day follows Benford’s place value probability distribution law:
Indeed if the simple interest rate, is a mere r=1% then we see the doubling, tripling,… periods of the principal occur over longer periods as follows:
We note that the doubling period for 1% is 70days. To triple your money takes a further 41 days. With daily 1% compounded interest your principal is over £10 in 232 days. As a percentage of these 232 days the doubling portion takes Benford’s 30% and tripling his 17.6%. Running a Benford test on the accumulating principal we see (in image below all leading numbers shown are 1) the following:
We find the χ² test statistic for the series of principals according to the following scheme:
In google sheets, as in an excel spreadsheet, the chisq distribution function requires as inputs the test statistic just calculated and the degrees of freedom in the system (which is 1 less than the number of place values times 1 less than the 2 columns being compared=8*1=8):
This is an intrinsically one-sided test. The small value of χ2 test statistic here of 0.06, close to zero, tells us that the data’s proportions are close to Benford’s Law, which is the null hypothesis.
From Number theory now consider the primary placeholders of the first 232 terms of the Fibonacci series which show Benford compliancy:
Finally, before looking at real data sets, consider the place values of the 9×9 multiplication table which delivers a Benford-compliant series of numbers:
If you download the data set of US jobs as per paper of S Lanham, you will get a per state job occupation list. If you then create a pivot table over the jobs to merge the states to give a US-wide picture you can see that the 808 US prescribed jobs according to bls are indeed filled according to Benford’s law
For UK male deaths from COVID cut by occupation according to ons DATA we also see a high degree of Benford compliance:
Interestingly we do not see compliance for COVID deaths per region (pivoted and collated along coarser regions than in the original spreadsheet from ons) in the UK in which 40% percentage error bars have been added to data set in order to capture a Benford profile.
This then suggests that there is a coarser/finer way to optimally regionalise the data such that Benford compliance will be achieved. As such one could argue that the regional cuts provided are not of the correct refinement to deliver meaningful comparison between eachother.
Indeed quite generally if one has a data set where the order of magnitude between the lowest and highest value in the series is at least 3 (as in these COVID death statistics) then non Benford compliance could be interpreted as a first indication of some misdoing.
In light of this, as a further example, as per the suggestion in “The Maths Behind”, by Colin Beverage consider the surface area of Michigan’s lakes. Downloading the list of more than 700 lakes with their surface areas from Wikipedia we do not see that good agreement. As such I am open for suggestions.
A nice source of data for the classroom would be the worldbank as per the accountancy article whose spreadsheet analysis I have appended to mine here.
References:
Analyzing Big Data with Benford’s Law: A Lesson for the Classroom, American Journal of Business Education–Second Quarter 2019, S Lanham ,https://clutejournals.com/index.php/AJBE/article/view/10285/10338
“The Maths Behind”, by Colin Beverage, https://www.hachette.co.uk/titles/colin-beveridge/the-maths-behind/9781844038985/
Testing for the Benford Property, Daniel P. Pike https://evoq-eval.siam.org/Portals/0/Publications/SIURO/Vol1_Issue1/Testing_for_the_Benford_Property.pdf?ver=2018-03-30-130233-050
Journal of Accountancy, J. Carlton Collins https://www.journalofaccountancy.com/issues/2017/apr/excel-and-benfords-law-to-detect-fraud.html
The google sheet used in the article can be found here.
Latest Posts
Follow Me
Get new content delivered directly to your inbox.