Chunlu Xiao
STAT 2501 Project
Benford's and Zipf's Law
Both Benford’s and Zipf’s Law are the result from a lot of real life data, and they are relative and can be applied in our real life. This paper will introduce and explain these two laws in a simply way.
Benford’s Law
Benford's Law, also called the First-Digit Law, refers to the frequency distribution of digits in many (but not all) real-life sources of data. In this distribution, 1 occurs as the leading digit about 30% of the time, while larger digits occur in that position less frequently: 9 as the first digit less than 5% of the time. Benford's Law also concerns the expected distribution for digits beyond the first, which approach a uniform distribution.
For , the proportion of whose first digit is is approximately . Thus, for instance, should have a first digit of 1 about 30% of the time, but a first digit of 9 only about 5% of the time.
The American astronomer Simon Newcomb discovered the law in 1881 that noticed that the first pages of books of logarithms were soiled much more than the remaining pages. In 1938, Frank Benford arrived at the same formula after a comprehensive investigation of listings of data covering a variety of natural phenomena. The law applies to budget, income tax or population figures as well as street addresses of people listed in the book American Men of Science. In the face of such universality of the law, it's quite astonishing that there exists a more general framework - Zipf's Law. Which, in turn, falls under a more general rubric of scaling phenomena.
This result has been found to apply to a wide variety of data sets, including electricity bills, street addresses, stock prices, population numbers, death rates, lengths of rivers, physical and mathematical constants[dubious – discuss], and processes described by power laws (which are very common in nature). It tends to be most accurate when values are distributed across multiple orders of magnitude.
Zipf’s Law
A Harvard linguistician George Kingsely Zipf discovered that in the English language words like "and," "the," "to," and "of" occurs often while words like "undeniable" are rare. This law applies to words in human or computer languages, operating system calls, colors in images, etc., and is the basis of many (if not, all!) compression approaches. It means the probability of occurrence of words or other items starts high and tapers off. Thus, a few occur very often while miany others occur rarely.
And in math form, Zipf’s Law will be like this:
The largest value of should obey an approximate power law, i.e. it should be approximately for the first few and some parameters . In many cases, is close to 1.
Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus.
Both of Benford and Zipf’s law are induction but not deduction. They are the conclusion from the real life and can be applied to the real life. The uses are extensive, like Benford’s Law can be used in stock and Zipf’s Law can be used in information retrieval.
I have learned a lot from this self-study. Benford law and Zipf’s Law are conclusion from our everyday life. For example, Frank Benford found out that the frequency for number 1 is way larger than other numbers. And from this he concludes the first-digit-law. I think this law has an extensive use. For example, it can use to test whether a series of data has been tampered or not, because one of the application requirement for Benford’s Law is it can’t be tamper, which means if data has been manipulation, it will not follow the law. This can apply to accounting examine and verify and save auditor a lot of time.
As for as I am concerned, the big idea of Benford’s Law and Zipf’s Law are similar, but from my knowledge, I think Benford’s Law is more useful than Zipf’s law and the application range is widen. But there is no doubt that Zipf’s law’s usefulness.
From my research I find out this two law are correlated with the uniform distribution, for example, Benford's Law concerns the expected distribution for digits beyond the first, which approach a uniform distribution. I think when instructor is lecturing the knowledge of uniform distribution, this two law can be mentioned and introduced.…...

