My Typing Stats (Featuring the Law of Large Numbers)
I've been using a keylogger to track every one of my keystrokes for the past month. Today, I decided to analyze that data.
In my analysis, I wanted to see how much my use of each letter deviates from the average. I assumed my usage would be different, because, in addition to typing in English, I type in Persian on my English keyboard as well. In my keyboard setup, each key in English represents the letter that most closely phonetically resembles it in Persian. Most of the English letters on the keyboard represent the letter that makes the exact same sound in Persian, and some represent letters with similar sounds (e.g. X represents the letter with the KH sound).
This is what the graph looks like (click on image for full size):
As stated in the graph, the blue bar and the green bar represent the frequency of the appearance of letters in the English language and in my typing, respectively, and the orange line represents the deviation of my usage of each letter from the average usage. The red line with the label NORMAL represents a deviation of 0% (i.e. where an orange dot would fall on if my frequency of usage of its corresponding letter was the exact same as its frequency in English).
Looking at the graph, I noticed something interesting: Z, Q, X, and J had the highest deviation from normal usage (I only noticed the non-absolute deviation because of how the graph is set up, but they have the highest absolute deviations too). Those four letters are also the top four least used letters in English (and in the same order too!).
What was interesting about this was that it seemed to follow the law of large numbers: that the larger sample of something we have, the less deviation it would have from what we expect.
To understand the law of large numbers, think of a huge coin-tossing event. There are a thousand people tossing coins. The first person flicks the coin once, the second person flicks it twice, the third person flicks it three times, and so on. The law of large numbers states that the 1000th person (who flicks the coin 1000 times) would have a higher chance of getting the expected 50% heads, 50% tails outcome than, say, the person who flicks 20 times. In short, it is expected that the deviation from the expected outcome would be lowest in high sample sizes and highest in smaller samples.
I decided to make another graph to check if my numbers actually did follow the law of large numbers. I needed to make a scatterplot with the X-axis representing the frequency of the letter in my typing, and the Y-axis being my absolute deviation in frequency of the usage of that letter from the average frequency of usage. This is how it turned out (click on image for full size):
As you can see, less-frequent letters, on average, deviated from the norm more than frequently-used letters -- just as the law of large numbers states.
Note: The expected outcome shouldn't have been for each letter to have the same frequency in my typings as in the English language, since I write Persian too. However, I would say that Persian makes up around 20% of my daily typing, so the effect should not be too huge (but larger on the less-frequent letters!)