Data distribution

From Sustainability Methods
Revision as of 10:40, 11 September 2019 by Prabesh (talk | contribs)

(The author of this entry is Henrik von Wehrden.)

Data distribution

Data distribution is the most basic and also fundamental step of analysis for any given data set. On the other hand, data distribution encompasses the most complex concepts in statistics, thereby including also a diversity of concepts that translates further into many different steps of analysis. Consequently, without understanding the basics of data distribution, it is next to impossible to understand any statistics down the road. Data distribution can be seen as the fundamentals, and we shall often return to these when building statistics further.

Types of distribution

The normal distribution

How wonderful, it is truly a miracle how almost everything that can be measured seems to be following the normal distribution. Overall, the is not only the most abundantly occurring, but also the earliest distribution that was known. It follows the premise that most data in any given dataset has its majority around a mean value, and only small amounts of the data are found at the extremes.

Take height. Most people have an average height, but only a few people are very tall, and a few people are very short. The majority of people have clearly an average height. Many such natural phenomenon follow the normal distribution. Just measure the weight of some spaghetti with a very precise balance. The majority will resolve around a mean value, and only some few will be much heavier or much lighter. While it may seem like a magic trick, it is actually true that many phenomena that can be measured will follow the normal distribution, at least when your sample is large enough. Consequently, much of the probabilistic statistics is built on the normal distribution.

See Tests for normal distributionto learn how to check if the data is normally distributed.

The Poisson distribution

Things that can be counted are often not normally distributed, but are instead skewed to the left. While this may seem curious, it actually makes a lot of sense. Take an example that coffee-drinkers may like. How many people do you think drink one or two cups of coffee per day? Quite many, I guess. How many drink 3-4 cups? Fewer people, I would say. Now how many drink 10 cups? Only a few, I hope. A similar and maybe more healthy example could be found in sports activities. How many people make 30 minute of sport per day? Quite many, maybe. But how many make 5 hours? Only some very few. In phenomenon that can be counted, such as sports activities in minutes per day, most people will tend to a lower amount of minutes, and few to a high amount of minutes. Now here comes the funny surprise. Transform the data following a Poisson distribution, and it will typically follow the normal distribution. Hence skewed data can be often transformed to match the normal distribution. While many people refrain from this, it actually may make sense in such examples as island biogeography. Discovered by MacArtur & Wilson, it is a prominent example of how the log of the numbers of species and the low of island size are closely related. While this is one of the fundamental basic of ecology, a statistician would have preferred the use of the Poisson distribution.

The Pareto distribution

Do you know that most people wear 20 % of their clothes 80 % of their time. This observation can be described by the Pareto distribution.