Chi-Square Compliance Testing of Benford’s Law within google sheets

Does UK ONS COVID data pass the Benford Law sniff test?

The largest lakes of Michigan by surface area follow Benford’s law (only within 30% error bands)

In the following we will allude to the background rationale of Benford’s law that identifies a pattern of prevalence of leading place-values of a data set. We do this through the example of the leading place value of a series formed by incrementing an initial principal of £1 by £1 each day. We take a look then at a few examples from Number theory that encode nicely the Benford prevalence relationship then follow up with a couple of UK and US data sets that with some pivot table manipulation in google sheets mostly show compliance.

Benford’s Law is well known for adjudging the efficacy of a data set: whether or not the data items so categorised have been randomly generated or (not so) craftily curated. We explore Benford’s law by reviewing some well known examples from finance and number theory, then apply them to the UK government’s ONS data set of COVID, US job types and Michican lakes sizes and see how the granularity of the categorisation of the data appears to apparently affect its efficacy.

Benford’s prevalence of leading numbers law is summarised in the following table:

The log law difference relationship underpinning Benford’s law

Looking at the log of the numbers from 1 to 10 we can see that the difference of 0.3 between log 2 and log 1 is the same as the difference between log 4 and log 2 or between log 4 and log 8. So Benford’s law says that if we look at the first digits of a set of numbers there should be as many 1s as first digits as 4,5,6 and 7 put together. As Colin Beverage puts it, “if your data is spread widely enough (and some other conditions apply), the lead digit of log(x) is (more or less) uniformly distributed”.

Those versed in the delights of compounded returns of finance will recognise the Benford probability decile distribution 1:30%, 2:18%, 3:12%, 4:10%, 5:8%,.. as the effective compounded return, R rates which arises from forming the log returns, log(1+r), built from that simple interest, r that delivers an increment of £1 each day on a principal of £1.

£1 increments to stock price each day give compounded growth rate, R according to Benford law.

The continuous compounded (log) return, R for a stock rising incrementally by £1 each day follows Benford’s place value probability distribution law:

Stock price rising by 1 unit each day delivers effective compound growth rates, R according to Benford’s law

Indeed if the simple interest rate, is a mere r=1% then we see the doubling, tripling,… periods of the principal occur over longer periods as follows:

simple interest rate of r=1% on £1 principal delivers compounded growth

We note that the doubling period for 1% is 70days. To triple your money takes a further 41 days. With daily 1% compounded interest your principal is over £10 in 232 days. As a percentage of these 232 days the doubling portion takes Benford’s 30% and tripling his 17.6%. Running a Benford test on the accumulating principal we see (in image below all leading numbers shown are 1) the following:

A count of frequency of leading place value for 1% compounded growth

We find the χ² test statistic for the series of principals according to the following scheme:

The sum of the squared differences between the percentage, p and Benford percentage, b as a fraction of p times the multiplier N=232 terms in series gives us our test statistic χ²

In google sheets, as in an excel spreadsheet, the chisq distribution function requires as inputs the test statistic just calculated and the degrees of freedom in the system (which is 1 less than the number of place values times 1 less than the 2 columns being compared=8*1=8):

google sheets implementation of chi-square test on compounded 1% series versus Benford

This is an intrinsically one-sided test. The small value of χ2 test statistic here of 0.06, close to zero, tells us that the data’s proportions are close to Benford’s Law, which is the null hypothesis.

From Number theory now consider the primary placeholders of the first 232 terms of the Fibonacci series which show Benford compliancy:

Primary place value of the first 232 terms of Fibonacci series follows Benford’s law

Finally, before looking at real data sets, consider the place values of the 9×9 multiplication table which delivers a Benford-compliant series of numbers:

9×9 multiplication delivers numbers that are Benford compliant to a good degree

If you download the data set of US jobs as per paper of S Lanham, you will get a per state job occupation list. If you then create a pivot table over the jobs to merge the states to give a US-wide picture you can see that the 808 US prescribed jobs according to bls are indeed filled according to Benford’s law

Number of US employees in prescribed job follows Benford’s law within a 10% uncertainty

For UK male deaths from COVID cut by occupation according to ons DATA we also see a high degree of Benford compliance:

Numbers of deaths and Death Rates per 100,000 (95% Confidence Intervals) involving COVID-19 for 4 digit SOC codes: men aged 20-64, England and Wales, deaths registered between 9 March and 28 December 2020

Interestingly we do not see compliance for COVID deaths per region (pivoted and collated along coarser regions than in the original spreadsheet from ons) in the UK in which 40% percentage error bars have been added to data set in order to capture a Benford profile.

Number of deaths due to COVID-19 in Middle-layer Super Output Areas, England and Wales, deaths registered between 1 March and 31 December 2020

This then suggests that there is a coarser/finer way to optimally regionalise the data such that Benford compliance will be achieved. As such one could argue that the regional cuts provided are not of the correct refinement to deliver meaningful comparison between eachother.

Indeed quite generally if one has a data set where the order of magnitude between the lowest and highest value in the series is at least 3 (as in these COVID death statistics) then non Benford compliance could be interpreted as a first indication of some misdoing.

In light of this, as a further example, as per the suggestion in “The Maths Behind”, by Colin Beverage consider the surface area of Michigan’s lakes. Downloading the list of more than 700 lakes with their surface areas from Wikipedia we do not see that good agreement. As such I am open for suggestions.

The lakes of Michigan and the poor Benford compliance

A nice source of data for the classroom would be the worldbank as per the accountancy article whose spreadsheet analysis I have appended to mine here.

References:

Analyzing Big Data with Benford’s Law: A Lesson for the Classroom, American Journal of Business Education–Second Quarter 2019, S Lanham ,https://clutejournals.com/index.php/AJBE/article/view/10285/10338

“The Maths Behind”, by Colin Beverage, https://www.hachette.co.uk/titles/colin-beveridge/the-maths-behind/9781844038985/

Testing for the Benford Property, Daniel P. Pike https://evoq-eval.siam.org/Portals/0/Publications/SIURO/Vol1_Issue1/Testing_for_the_Benford_Property.pdf?ver=2018-03-30-130233-050

Journal of Accountancy, J. Carlton Collins https://www.journalofaccountancy.com/issues/2017/apr/excel-and-benfords-law-to-detect-fraud.html

Magic of Log Returns: Concept – Part 1

The google sheet used in the article can be found here.

Independence Testing

Latest Posts

Uncategorized

Number Theory As a Data Science

The content of this book is almost there: Number Partitioning, Reptend Prime Orbit Cyclicality, Semi-Pseudo Prime Number Epigraphing, Aperiodic Fibonacci Tiling, Primitives Generation-Triangulation-Determination, Double exaggerated Derrangement, p-adic systems, Figurative numbers, AbSurd Irrationality of Metallica , Giving a Toss by sticking it to Pascal, Ensembles of Balls, Integer Lattice Point Diophantism,…, I just need to tell…

10 Apr 202411 Apr 2024

Uncategorized

Generating function for nth-triangle is-a-square sequence from a compound surd power expansion

Considering the following, $(\sqrt{2}-1)^2=\sqrt{9}-\sqrt{8},$ $(\sqrt{2}-1)^3=\sqrt{50}-\sqrt{49},$ $(\sqrt{2}-1)^4=\sqrt{289}-\sqrt{288},$ it is natural to ask (as @MathsTechnology did on twitter) if every positive integer power, n of (√2 – 1) is the difference between the square roots of consecutive integers? We will find that it is and the search delivers a means to determine the generator,…

23 Oct 202223 Oct 2022

High School Maths

Picturing Primitive Pythagorean Triples

Generating the ordered set of primitive Pythagorean triples from the set of straight lines with rational gradients that intersect the unit circle, reveals some interesting structure to be explored in the following. To be sure, the Pythagorean triples are the set of right angled triangles whose three sides have Natural numbers as lengths: You can…

4 Apr 20222 May 2022

Uncategorized

Gauging a lifetime on skid rho

There are roughly $\pi\times 10^7$ seconds in a year. Now you know. So Muller’s Giant Sunda Rat lives up to $\pi\times 10,000,000/(24\times 3600)$ May fly lifetimes. Or a Corona virus on cardboard lifetimes for that matter. Us humans stay relatively stable over the course $80\times\pi\times 10^7s\approx 2\times 10^9s$ . Our Sun being $latex…