Difference between revisions of "Data formats"

From Sustainability Methods
(86 intermediate revisions by 7 users not shown)
Line 1: Line 1:
(The author of this entry is [[Wiki contributors|Henrik von Wehrden]].)
+
'''Note:''' The German version of this entry can be found here: [[Data formats (German)]].
  
='''Data formats in statistics'''=
+
'''In short''': This entry introduces different data formats.
The format of your data influences everything else you do further down the road. To paraphrase a proverb, data is in a format, and the format is the data. Therefore, it is essential to know which [https://www.youtube.com/watch?v=hZxnzfnt5v8 different data formats] exist, and how these may be beneficial, and where you may encounter pitfalls.
 
  
=='''An example of different data formats'''==
+
==Data formats in statistics==
 +
The format of your data influences everything else you do further down the road. To paraphrase a proverb, data is in a format, and the format is the data. Therefore, it is essential to know which [https://www.youtube.com/watch?v=hZxnzfnt5v8 different data formats] exist, and how these may be beneficial, and where you may encounter pitfalls. For more information on different means of measurement, please refer to the [[To Rule And To Measure|'To Rule And To Measure' entry.]]
 +
 
 +
The most important difference is between quantitative data and qualitative data. Quantitative data can consist of continuous, discrete or interval data, while qualitative data can be factorial -meaning in truly different categories- nominal or ordinal, with the latter two providing a link to quantitative data. However, within different areas of science, the nomenclature for data formats widely differs, and to be honest, it is a mess. Here, we try to be consistent, yet please be aware that these names are not consistent across science.
 +
 
 +
 
 +
====Examples of different data formats====
 
[[File:DataFormatsDiet.png|thumb|right|Tracking your diet is just one of many examples how you can approach different data formats.]]
 
[[File:DataFormatsDiet.png|thumb|right|Tracking your diet is just one of many examples how you can approach different data formats.]]
Imagine you want to track your diet. Many people do this today, there are diet books and advice everywhere, much information has become available. Now you want to start and become more familiar with what you eat. How would you start? Counting calories? Differentiating between carbs, fat and greens? Maybe you just count every time you ate a Pizza? Or ice cream? Or too much? There are many ways to measure your diet. And these measurement can be in different data formats.
+
Imagine you want to track your diet. Many people do this today, there are diet books and advice everywhere, much information has become available. Now you want to start and become more familiar with what you eat. How would you start? Counting calories? Differentiating between carbs, fats and greens? Maybe you just count every time you ate a pizza? Or ice cream? Or too much? There are many ways to measure your diet. And these measurements can be in different data formats.
 +
 
 +
Most data formats can be transformed into other data formats, which is often confusing for many people. For instance nominal data can be counted repeatedly, you may for instance count the quite diverse and different cups of coffee you drink every day, such as a Flat-white, an American, and two Espressi. The numbers of cups would then add up to cups of coffee, which would represent discrete data. A different example would be a person's height, which could be represented as continuous data in meters. While this can be represented in numbers, it could also be represented in categories "short" and "tall".
 +
 
 +
=== Quantitative data ===
 +
'''Quantitative data (numeric) is data that is expressed in numbers which can be used in a numerical sense,''' i.e. the numbers can be used to do calculations. There are three types of quantitative data: Continuous, discrete and interval data.
 +
 
 +
====Continuous data====
 +
'''Continuous data is numerical data that cannot be counted because it exists on a finite or infinite number line.''' We are all familiar with continuous numbers. Much of our society is ruled by these numbers, and thus much of data analysed in statistics is represented by continuous numbers. Since much of modern measurement is automated within a given predefined system, we often do not have to worry too much how data looks like. Take for instance weight or size. Within Central Europe, this is clearly measured in grams or kilograms, and in centimeters or meters, respectively. However, if you move to the US, it becomes a whole different story, because of the [https://www.factmonster.com/math-science/weights-measures/metric-weights-and-measures metric system], or the lack thereof. Suddenly you are some feet tall, and you may weigh some "stones". Many [https://www.factmonster.com/math-science/weights-measures/the-international-system-metric diverse measurement systems] exist, and one has to be aware of how these were measured. Hence these systems are constructs, and these constructs build on continuous numbers. Continuous numbers are widely used to express data, but we have to be aware that this then still represents normative information.
  
=='''Continuous data'''==
+
Continuous data has a true zero. A true zero is defined as a total absence of something that can be represented in numbers. Although a weight of 0 kg or a length of 0 m is abstract, the values represent the absence of weight and length, respectively.  
[[File:Continuous Data.jpg|thumb|left|There are sometimes more than one system of measuring data, as you can see here. Nevertheless, both of them are continuous.]]
 
We are all familiar with continuous numbers. Much of our society is ruled by these numbers, and thus much of data analysed in statistics is represented by continuous numbers. Since much of modern measurement is automatically within a given predefined system, we often do not have to worry too much how data looks like. Take for instance weight or size. Within [https://www.factmonster.com/math-science/weights-measures/metric-weights-and-measures middle Europe], this is clearly measured in grams or Kilograms, and in Centimeters or Meters, respectively. However, if you move to the US, it becomes a whole different story, because of the metric system, or the lack thereof.
 
Suddenly you are some feet tall, and you may weigh some "stones". Many [https://www.factmonster.com/math-science/weights-measures/the-international-system-metric diverse measurement systems] exist, and one has to be aware of how these were measured. Take temperature, which I would measure in Celsius. However, my friends from the [https://www.factmonster.com/math-science/weights-measures/us-weights-and-measures US] are stuck with Fahrenheit, which to me is entirely counter-intuitive. I think the fact that water freezes at 0°C, and boils at 100°C makes Celsius almost divine; however looking at the lowest possible temperature (-273 °C) already showcases that Celsius may not be so divine after all. Hence these systems are constructs, and these constructs build on continuous numbers. Another prominent construct expressed in continuous numbers is the [https://www.iqmindware.com/wiki/what-does-my-iq-score-mean Intelligence Quotient]. Being [https://www.youtube.com/watch?v=7p2a9B35Xn0 highly questionable] from a research standpoint, it serves nevertheless as a basis to identify the elitist Mensa Members. With an IQ of 100, you are considered to be average. Yet, already the expression of what higher and lower numbers mean is widely disagreed upon. This showcases that continuous numbers are widely used to express data, but we have to be aware that this then still represents normative information.
 
  
==='''Examples'''===
+
'''Examples of continuous data:'''<br/>
 
- the number Pi: 3,14159265359...
 
- the number Pi: 3,14159265359...
 
<br>
 
<br>
- typical weigh of a naked mole-rat: 30 grams
+
- typical weight of a naked mole-rat: 30 grams
 
<br>
 
<br>
 
- the height of the Empire State Building: 443,2m
 
- the height of the Empire State Building: 443,2m
 
<br>
 
<br>
- the melting temperature of dark choclate: 45-50°C
 
  
=='''Ordinal data'''==
+
==== Discrete data ====
[[File:Likert scale.jpg|thumb|right|Likert Scale]]
+
'''Discrete data is numeric data that can be counted because it only exists as natural numbers''' (1, 2, 3, 4...). Examples of this are students in a lecture, where the use of fraction numbers is not helpful. Of course, you can think of an halved apple, but usually, if we count apples or birds or students, we consider them as complete units and stick to natural numbers. Discrete data is often also referred to as 'abundance' or 'counting' data, and within the R language it is called "integer".
[[File: Ordinal Data.jpg|thumb|right|Even if one can disagree about the objectivity and purpose of marks, it is a vivid example for ordinal data.]]
 
Remember your school grades? A "1" is the best grade in the barman system, but is it twice as good than a "2"? Hardly. Such grades are [http://intellspot.com/nominal-vs-ordinal-data/ ordinal numbers]. These are a system of numbers that are ranked in some sense, but the numbers per se do not necessarily reflect a numeric system. In other words, they are highly normative and contested. A "2" might be a good grade for some, and a disaster for others. Ordinal formats are often clearly defined scales that allow people to grade, evaluate or rank certain information. One of the most prominent examples is the [https://www.simplypsychology.org/likert-scale.html Likert scale] that is often used in Psychology. In this case, the scaling is often not reflected in numbers at all, but in levels such as "Strongly agree" or "disagree". Such constructed scales may make a true statistician very unhappy, since these scales are hard to analyse, yet there is hardly any alternative since it also does not make any sense to ask: "How happy are you on a scale from 1 to 100?". Therefore, ordinal scales are often relevant in order to create a scaling system that allows for wide comparability or even becomes a norm, such as school grades. My advise would be to [https://sciencing.com/advantages-disadvantages-using-ordinal-measurement-12043783.html use ordinal scales] when this is common practise in this branch of science. Read other studies in the field, and then decide. These are highly constructed scales, hence there needs to be clear reasoning on why you want to use them.
 
  
=='''Nominal data'''==
+
Discrete data also has a true zero. Take again the number of students in a statistics lecture. Although the lecture is good, for example because it includes songs of Sesame Street, there might be no students in the lecture. 0 students in a lecture – there you got your true zero.
[[File: Gummy Bears.jpg|thumb|right|Gummy bears are a nice example, as you can classify them by coulor, which would be nominal data. But if you weigh them, you get continuous data again.]]
 
Whenever you have ordinal data that represents levels that cannot be ranked, it is called [https://formpl.us/blog/nominal-data nominal data]. An example would be different ethnicities, of different types of gender. This already highlights, that we are here confronted by often completely different worldviews, thus nominal data represents a stark case of a normative view of the world. Gender is a prominent example, since some people still define gender by a biological stereotype (Female/Male), which according to my worldview is clearly wrong. Nominal data formats hence demand an even clearer reflection than ordinal data, where at least you may say that a certain school grade is higher than another one. This is not the case for nominal data. Therefore, one has to be extra careful about the implications, that a specific constructed scale may imply.
 
  
=='''Binary data'''==
+
==== Interval data ====
[[File: Mainzelmännchen-Ampel.png|thumb|right|Another Case of Binary Data]]
+
'''Interval data consists of measured or counted values, but it does not have a true zero.''' Also, the difference between two data points is equal no matter where on the scale you look. The best example is temperature if measured in °C. The difference between 30°C and 40°C is equal to the difference between 100°C and 110°C. However, there is no true zero to the Celsius scale: 0°C does not mean that there is no temperature. Rather, 0°C represents a specific value on the temperature scale. Therefore, you can subtract and add up temperature data, but you cannot meaningfully multiply or divide with it. In addition, the lack of a real zero means that 40°C is not twice as much energy as 20°C, although the value is twice as high.
The most reduced data format is [https://en.wikipedia.org/wiki/Binary_data binary data], which basically consists of two levels. In [https://www.youtube.com/watch?v=ewokFOSxabs computer science] this may be a simple 0 and 1, but the great breakthrough of that dataset was early on in the insurance business as well as in medicine, where dead or alive are often the most fundamental questions. Binary information is clearly simplistic, but quite often this matches with a certain view of reality. Take the example of being able to play an instrument. If somebody asks you whether you can play the piano, you will probably say yes or no. You may most likely not qualify your answer by saying "I play better than a monkey, but worse than Horowitz". Some modest folks may say "I can play a bit", or "I am not very good", or "I used to be better", but very often people answer yes or no. Hence binary data allows for a simple view of reality, and this may often match with the world how we perceive it. But be aware: Other people may have a less simple view.
 
  
=='''Choosing the right data format'''==
 
You may wonder now how to choose the right data format. The answer to that is quite simple. Any data format should be as simple as possible, and as complex as necessary. Follow Occam's razor, and you will be fine. Of course this sounds appealing, but how to know what is too simple, and what is too complex. Here, I suggest you build on the available literature. Read other publications that examined a certain phenomenon before, these papers may guide you in choosing the right scale.
 
 
=='''Overview about characteristics of some data formats'''==
 
[[File:Data Formats Table small 7.jpg|left|Data formats and characteristics]]
 
  
 +
=== Qualitative data ===
 +
'''Qualitative (categorical) data in a statistical sense is data that can be stored in labeled categories which are independent from each other.''' Such categories are typically constructed, and thus contain information that is deeply normative or designed. An example would be hair color, which can be in human perceptions of colours, yet is often also described with different names when it comes to professional hair products. Within statistics, categories are often designed so that within a scientific experiment, categories are constructed in a sense that allows for a meaningful testing of the hypothesis, and meaningful is then in the eye of the beholder. Different levels of fertiliser would be such an example, and the categories would often be built around previous knowledge or pre-tests. Categories are thus of particular importance when it comes to the reduction of the complexity of the world, as it would not be possible to test all sorts of different levels of fertiliser in an experiment. Instead, you might go with "little", "moderate", "much" and "very much" fertiliser. Nevertheless, this demands a clear recognition that and how categories are constructed, and deeply normative.
  
 +
There are two types of qualitative data: ordinal data and nominal data - and then there is binary data, which is basically also nominal.
  
 +
====Ordinal data====
 +
[[File: Ordinal Data.jpg|thumb|right|'''School grades are an example of ordinal data.''']]
 +
'''Ordinal data is categorical data that can be ranked, but not calculated with, even if it is represented in numbers'''. Remember your school grades? A "1" is the best grade in the German grading system, but is it twice as good than a "2"? Hardly. Such grades are [http://intellspot.com/nominal-vs-ordinal-data/ ordinal numbers]. These are a system of numbers that are ranked in some sense, but the numbers per se do not necessarily reflect a numeric system. In other words, they are highly normative and contested. A "2" might be a good grade for some, and a disaster for others. Ordinal formats are often clearly defined scales that allow people to grade, evaluate or rank certain information. One of the most prominent examples is the [[Likert Scale|Likert scale]] that is often used in Psychology. In this case, the scaling is often not reflected in numbers at all, but in levels such as "Strongly Agree" or "Rather Disagree". Such constructed scales may make a true statistician very unhappy, since these scales are hard to analyse, yet there is hardly any alternative since it also does not make any sense to ask: "How happy are you on a scale from 1 to 100?". Therefore, ordinal scales are often relevant in order to create a scaling system that allows for wide comparability or even becomes a norm, such as school grades. My advise would be to [https://sciencing.com/advantages-disadvantages-using-ordinal-measurement-12043783.html use ordinal scales] when this is common practise in this branch of science. Read other studies in the field, and then decide. These are highly constructed scales, hence there needs to be clear reasoning on why you want to use them.
  
 +
====Nominal data====
 +
[[File: Gummy Bears.jpg|thumb|right|'''Gummy bears are a nice example for data formats.''' You can classify them by color, which would be nominal data. But if you weigh them, you get continuous data again.]]
 +
'''Whenever you have categorical data that cannot be ranked, it is called [https://formpl.us/blog/nominal-data nominal data]'''. An example would be different ethnicities, countries of birth, or different types of gender. This already highlights that we are here confronted by often completely different worldviews, thus nominal data represents a stark case of a normative view of the world. Gender is a prominent example, since some people still define gender by a biological stereotype (Female/Male) and thus binary (see below), which according to my worldview is clearly wrong, and I see gender as nominal with more than two categories. Nominal data formats hence demand an even clearer reflection than ordinal data, where at least you may say that a certain school grade is higher than another one. This is not the case for nominal data. Therefore, one has to be extra careful about the implications that a specific constructed scale may imply.
  
 +
====Binary data====
 +
[[File: Mainzelmännchen-Ampel.png|200px|thumb|right|'''An example of binary data''']]
 +
'''Binary data is the most reduced data format, which basically consists of two levels: 1 and 0.''' It is, strictly speaking, nominal data, but nominal data that only exists in two versions which can be translated into 1 and 0: On / Off, Yes / No. In [https://www.youtube.com/watch?v=ewokFOSxabs computer science] binary data is used directly as simple 0 and 1, but the great breakthrough of that dataset was early on in the insurance business as well as in medicine, where 'dead' or 'alive' are often the most fundamental questions. Binary information is clearly simplistic, but quite often this matches with a certain view of reality. Take the example of being able to play an instrument. If somebody asks you whether you can play the piano, you will probably say ''yes'' or ''no''. You may most likely not qualify your answer by saying "I play better than a monkey, but worse than Horowitz". Some modest folks may say "I can play a bit", or "I am not very good", or "I used to be better", but very often people answer ''yes'' or ''no''. Hence binary data allows for a simple view of reality, and this may often match with the world how we perceive it. But be aware: Other people may have a less simple view.
  
 +
== Choosing the right data format ==
 +
You may wonder now how to choose the right data format for your data gathering. The answer to that is quite simple. '''Any data format should be as simple as possible, and as complex as necessary.''' Follow Occam's razor, and you will be fine. Of course this sounds appealing, but how to know what is too simple, and what is too complex? Here, I suggest you build on the available literature. Read other publications that examined a certain phenomenon before, these papers may guide you in choosing the right scale.
  
 +
This table gives you some more information on different data formats - maybe it can help you design your study?
 +
[[File:Data Formats Table small 7.jpg|thumb|1000px|center|'''Different data formats and their characteristics.'''. Source: own]]
  
  
 +
== Which simple test works for which data format? ==
 +
The following table which we compiled shows which statistical tests are useful depending on the data you have. To learn more about these tests, please refer to the entries on [[Simple Statistical Tests]], [[Regression Analysis]], [[Correlations]] and [[ANOVA]]. Note: for combinations that lead to different methods (e.g. ordinal x continuous), please refer to all mentioned approaches.<br/>
 +
[[File:Table Simple Tests.png|600px|frameless|center|'''Which simple tests do you use for which kinds of data formats?''' Source: own.]]
  
  
 +
== A word on indices ==
 +
In economics and finance, an index is a statistical measure of change in a representative group of individual data points. A good example of the application of an index that most people know is the [https://www.investopedia.com/terms/g/gdp.asp GDP], the gross domestic product of a country. Although it has largely been criticised for being too generalised and not offering enough nuance to understand the complexity of the single country, many social, economical and other indicators are correlated with the GDP.
 +
[[File:Bildschirmfoto 2020-04-11 um 11.24.41.png|thumb| Indices appear also during our every day life like a picture of the latest developments at the stock market.]]
 +
In ecology, a prominent example for an index is the so-called [https://www.youtube.com/watch?v=ghhZClDRK_g Shannon Wiener index], which represents abundance corrected diversity measures. A prominent example from economy again is the [https://www.youtube.com/watch?v=_PXFVNWINQc Dow Jones index] while the [http://hdr.undp.org/en/content/human-development-index-hdi human development index] tries to integrate information about life expectancy, education and income in order to get a general understanding about several components that characterise countries. The [https://www.investopedia.com/terms/g/gini-index.asp GINI coefficient] tries to measure inequality, a surely daring endeavour, but nevertheless quite important. In psychology the [https://www.youtube.com/watch?v=7p2a9B35Xn0 intelligence quotient (IQ)], which is of course heavily criticised, is a known example of reducing many complex tests into one overall number. Indices and quotients are hence constructs that are often based on many variables and try to reduce the complexity of these diverse indicators into one set of numbers.
  
 +
== Further Information ==
 +
====Videos====
 +
[https://www.youtube.com/watch?v=7p2a9B35Xn0 Intelligence Quotient]: Answering the question if the IQ really measures how smart you are
  
 +
[https://www.youtube.com/watch?v=hZxnzfnt5v8 Different data formats]: An overview
  
 +
[https://www.youtube.com/watch?v=ewokFOSxabs Binary data]: How our computer works
  
 +
[https://www.youtube.com/watch?v=ghhZClDRK_g The Shannon Wiener index]: An example from ecology
  
 +
[https://www.youtube.com/watch?v=_PXFVNWINQc The Dow Jones Index]: An example from economy
  
 
+
[https://www.youtube.com/watch?v=7p2a9B35Xn0 The Intelligence Quotient]: A critical reflection
 
 
 
 
 
 
 
 
='''A word on indices in statistics'''=
 
In economics and finance, an index is a statistical measure of change in a representative group of individual data points. A good example of the application of an index that most people know is the [https://www.investopedia.com/terms/g/gdp.asp GDP], the gross domestic product of a country. I largely been criticised for being too generalised and not offering enough nuance to understand the complexity of the single country, many social, economical and other indicators are correlated with the GDP. In ecology a prominent example for an index is the so-called Shannon Wiener index, which represents abundance corrected diversity measures. A prominent example from economy again is the Dow Jones index while the human development index tries to integrate information about life expectancy education and income in order to get a general understanding about several components that characterise countries. The Jeannie coefficient tries to measure inequality, as surely daring endeavour, but nevertheless quite important. In psychology the intelligence quotient, which is of course heavily criticised, is a known example of reducing many complex tests into one overall number. In the face and quotients are hence constructs that are often based on many variables and try to reduce the complexity of these diverse indicators into one set of numbers.
 
 
 
='''Descriptive statistics'''=
 
[[File:Bildschirmfoto 2020-03-28 um 15.39.37.png|thumb|Descriptive Statistics is the most basic things you can do in statistics. Most of you probably also already calculated things like mean and median in school.]]
 
 
 
[https://www.youtube.com/watch?v=h8EYEJ32oQ8&list=PLU5aQXLWR3_yYS0ZYRA-5g5YSSYLNZ6Mc Descriptive stats] are what most people think stats are all about. Many people believe that the simple observation of more or less, or the mere calculation of an average value is what statistics are all about. The median often shows us such descriptive statistics in whimsical bar plots or even pie charts. Hence many numbers can be compiled into descriptive statistics, which can help to gain an overview about simple understanding of more complex numbers. The emphasis is however, that this is not an analysis. Instead such a compilation of data can only aid some overview. In order to be versatile in this type of [http://intellspot.com/descriptive-statistics-examples/ descriptive statistics], it is important to know [https://www.investopedia.com/terms/d/descriptive_statistics.asp some basics].
 
 
 
[[File:Bildschirmfoto 2020-03-28 um 15.48.41.png|thumb|left|This graphic visualizes what mean, mode and median explain regarding a dataset.]]
 
 
 
=='''Mean'''==
 
The [https://www.youtube.com/watch?v=mk8tOD0t8M0 mean] is the average of numbers you can simply calculated by adding up all the numbers and then divide them by how many numbers there are in total.
 
 
 
=='''Median'''==
 
The medium is the middle number in assorted set of numbers. It can be substantially different from the mean value for instance, when you have large gaps or cover wide ranges within your data. Therefore, it is more robust against outliers.
 
 
 
=='''Mode'''==
 
The mode is the value that appears to most often it can be helpful in large datasets are when you have a lot of repetitions within the dataset.
 
 
 
=='''Range'''==
 
The range is simply the difference between the lowest and the highest value and consequently it can also be calculated like this.
 
 
 
=='''Standard deviation'''==
 
The standard deviation is calculated as the square root of variance by determining the variation between each data point relative to the mean. It is a measure of how spread out your numbers are. If the data points are further from the mean, there is a higher deviation within the data set. The higher the standard deviation, the more spread out the data.
 
 
 
[[File:Bildschirmfoto 2020-03-28 um 15.51.31.png|thumb|left|This graph shows how the standard deviation is spread from the mean.]]
 
 
 
<syntaxhighlight lang="R" line>
 
 
 
#descriptive statistics using the Swiss dataset
 
swiss
 
swiss_data<-swiss
 
 
 
#we are choosing the column fertility for this example
 
#let's begin with calculating the mean
 
mean(swiss_data$Fertility)
 
 
 
#median
 
median(swiss_data$Fertility)
 
 
 
#range
 
range(swiss_data$Fertility)
 
 
 
#standard deviation
 
sd(swiss_data$Fertility)
 
 
 
#summary - includes minimum, maximum, mean, median, 1st & 3rd Quartile
 
summary(swiss_data$Fertility)
 
 
 
</syntaxhighlight>
 
 
 
='''Back of the envelope statistics'''=
 
 
 
[[File:Bildschirmfoto 2020-04-08 um 11.37.25.png|thumb|left|Back of the envelope calculations give you a first impression about your idea and where it can go to.]]
 
 
 
[https://www.investopedia.com/terms/b/back-of-the-envelope-calculation.asp Back of the envelope calculations] are rough estimates that are made on the small piece of paper, hence the name. These are extremely helpful to get a quick estimate about the basic numbers for a given formula of principle, thus enable us to get her [https://www.stlouisfed.org/on-the-economy/2020/march/back-envelope-estimates-next-quarters-unemployment-rate quick calculation] with either the goal to check for the plausibility of the assumption, or to derive a simple explanation of the more complex issue. Back of the envelope calculations can be for instance helpful when you want to get a rough estimate about an idea that can be expressed in numbers. Prominent examples for back of the envelope calculations include the dominant character coding of the World Wide Web and the development of the laser. Back of the envelope calculations are fantastic within sustainability science, I think, because they can help us to illustrate complex issues in a more simple form, and they can serve as her guideline for a quick planning. Therefore, they can be used within other more complex forms of methodological applications, such as scenario planning. By quickly calculating different scenarios we can for instance make her plausibility check and focus our approaches on-the-fly. I encourage you to learn back of the envelope calculations in your [https://www.youtube.com/watch?v=bAU1MLRwh7c everyday life], as many of us already do. I learned to love "Tydlig", which is one of the best apps I ever used, but unfortunately I only know her version for my Apple devices. It can however be quite helpful to break numbers down into overall estimates, as the video below illustrates.
 
 
 
='''External links'''=
 
=='''Videos'''==
 
 
 
[https://www.youtube.com/watch?v=7p2a9B35Xn0 Intelligence Quotient]: Answering the question if the IQ really measures how smart you are
 
 
 
[https://www.youtube.com/watch?v=hZxnzfnt5v8 Different data formats]: An overview
 
 
 
[https://www.youtube.com/watch?v=ewokFOSxabs Binary data]: How our computer works
 
  
 
[https://www.youtube.com/watch?v=h8EYEJ32oQ8&list=PLU5aQXLWR3_yYS0ZYRA-5g5YSSYLNZ6Mc Descriptive Statistics]: A whole video series about descriptive statistics from the Khan academy
 
[https://www.youtube.com/watch?v=h8EYEJ32oQ8&list=PLU5aQXLWR3_yYS0ZYRA-5g5YSSYLNZ6Mc Descriptive Statistics]: A whole video series about descriptive statistics from the Khan academy
Line 136: Line 96:
 
[https://www.youtube.com/watch?v=bAU1MLRwh7c Back-of-envelope office space conundrum]: A real life example
 
[https://www.youtube.com/watch?v=bAU1MLRwh7c Back-of-envelope office space conundrum]: A real life example
  
=='''Articles'''==
+
====Articles====
 
 
 
[https://www.factmonster.com/math-science/weights-measures/metric-weights-and-measures Measurement]: Reflecting upon different measurement systems across the globe
 
[https://www.factmonster.com/math-science/weights-measures/metric-weights-and-measures Measurement]: Reflecting upon different measurement systems across the globe
  
Line 153: Line 112:
  
 
[https://www.investopedia.com/terms/g/gdp.asp GDP]: A detailed article
 
[https://www.investopedia.com/terms/g/gdp.asp GDP]: A detailed article
 +
 +
[http://hdr.undp.org/en/content/human-development-index-hdi The Human Development Index]: An alternative to the GDP
 +
 +
[https://www.investopedia.com/terms/g/gini-index.asp The GINI index]: A measure of inequality
  
 
[https://www.investopedia.com/terms/d/descriptive_statistics.asp Descriptive Statistics]: An introduction
 
[https://www.investopedia.com/terms/d/descriptive_statistics.asp Descriptive Statistics]: An introduction
Line 161: Line 124:
  
 
[https://www.stlouisfed.org/on-the-economy/2020/march/back-envelope-estimates-next-quarters-unemployment-rate Estimates of Next Quarter’s Unemployment Rate]: An Example For Back of the Envelope Statistics
 
[https://www.stlouisfed.org/on-the-economy/2020/march/back-envelope-estimates-next-quarters-unemployment-rate Estimates of Next Quarter’s Unemployment Rate]: An Example For Back of the Envelope Statistics
 +
----
 +
[[Category:Statistics]]
 +
 +
The [[Table of Contributors|author]] of this entry is Henrik von Wehrden.

Revision as of 09:41, 3 November 2023

Note: The German version of this entry can be found here: Data formats (German).

In short: This entry introduces different data formats.

Data formats in statistics

The format of your data influences everything else you do further down the road. To paraphrase a proverb, data is in a format, and the format is the data. Therefore, it is essential to know which different data formats exist, and how these may be beneficial, and where you may encounter pitfalls. For more information on different means of measurement, please refer to the 'To Rule And To Measure' entry.

The most important difference is between quantitative data and qualitative data. Quantitative data can consist of continuous, discrete or interval data, while qualitative data can be factorial -meaning in truly different categories- nominal or ordinal, with the latter two providing a link to quantitative data. However, within different areas of science, the nomenclature for data formats widely differs, and to be honest, it is a mess. Here, we try to be consistent, yet please be aware that these names are not consistent across science.


Examples of different data formats

Tracking your diet is just one of many examples how you can approach different data formats.

Imagine you want to track your diet. Many people do this today, there are diet books and advice everywhere, much information has become available. Now you want to start and become more familiar with what you eat. How would you start? Counting calories? Differentiating between carbs, fats and greens? Maybe you just count every time you ate a pizza? Or ice cream? Or too much? There are many ways to measure your diet. And these measurements can be in different data formats.

Most data formats can be transformed into other data formats, which is often confusing for many people. For instance nominal data can be counted repeatedly, you may for instance count the quite diverse and different cups of coffee you drink every day, such as a Flat-white, an American, and two Espressi. The numbers of cups would then add up to cups of coffee, which would represent discrete data. A different example would be a person's height, which could be represented as continuous data in meters. While this can be represented in numbers, it could also be represented in categories "short" and "tall".

Quantitative data

Quantitative data (numeric) is data that is expressed in numbers which can be used in a numerical sense, i.e. the numbers can be used to do calculations. There are three types of quantitative data: Continuous, discrete and interval data.

Continuous data

Continuous data is numerical data that cannot be counted because it exists on a finite or infinite number line. We are all familiar with continuous numbers. Much of our society is ruled by these numbers, and thus much of data analysed in statistics is represented by continuous numbers. Since much of modern measurement is automated within a given predefined system, we often do not have to worry too much how data looks like. Take for instance weight or size. Within Central Europe, this is clearly measured in grams or kilograms, and in centimeters or meters, respectively. However, if you move to the US, it becomes a whole different story, because of the metric system, or the lack thereof. Suddenly you are some feet tall, and you may weigh some "stones". Many diverse measurement systems exist, and one has to be aware of how these were measured. Hence these systems are constructs, and these constructs build on continuous numbers. Continuous numbers are widely used to express data, but we have to be aware that this then still represents normative information.

Continuous data has a true zero. A true zero is defined as a total absence of something that can be represented in numbers. Although a weight of 0 kg or a length of 0 m is abstract, the values represent the absence of weight and length, respectively.

Examples of continuous data:
- the number Pi: 3,14159265359...
- typical weight of a naked mole-rat: 30 grams
- the height of the Empire State Building: 443,2m

Discrete data

Discrete data is numeric data that can be counted because it only exists as natural numbers (1, 2, 3, 4...). Examples of this are students in a lecture, where the use of fraction numbers is not helpful. Of course, you can think of an halved apple, but usually, if we count apples or birds or students, we consider them as complete units and stick to natural numbers. Discrete data is often also referred to as 'abundance' or 'counting' data, and within the R language it is called "integer".

Discrete data also has a true zero. Take again the number of students in a statistics lecture. Although the lecture is good, for example because it includes songs of Sesame Street, there might be no students in the lecture. 0 students in a lecture – there you got your true zero.

Interval data

Interval data consists of measured or counted values, but it does not have a true zero. Also, the difference between two data points is equal no matter where on the scale you look. The best example is temperature if measured in °C. The difference between 30°C and 40°C is equal to the difference between 100°C and 110°C. However, there is no true zero to the Celsius scale: 0°C does not mean that there is no temperature. Rather, 0°C represents a specific value on the temperature scale. Therefore, you can subtract and add up temperature data, but you cannot meaningfully multiply or divide with it. In addition, the lack of a real zero means that 40°C is not twice as much energy as 20°C, although the value is twice as high.


Qualitative data

Qualitative (categorical) data in a statistical sense is data that can be stored in labeled categories which are independent from each other. Such categories are typically constructed, and thus contain information that is deeply normative or designed. An example would be hair color, which can be in human perceptions of colours, yet is often also described with different names when it comes to professional hair products. Within statistics, categories are often designed so that within a scientific experiment, categories are constructed in a sense that allows for a meaningful testing of the hypothesis, and meaningful is then in the eye of the beholder. Different levels of fertiliser would be such an example, and the categories would often be built around previous knowledge or pre-tests. Categories are thus of particular importance when it comes to the reduction of the complexity of the world, as it would not be possible to test all sorts of different levels of fertiliser in an experiment. Instead, you might go with "little", "moderate", "much" and "very much" fertiliser. Nevertheless, this demands a clear recognition that and how categories are constructed, and deeply normative.

There are two types of qualitative data: ordinal data and nominal data - and then there is binary data, which is basically also nominal.

Ordinal data

School grades are an example of ordinal data.

Ordinal data is categorical data that can be ranked, but not calculated with, even if it is represented in numbers. Remember your school grades? A "1" is the best grade in the German grading system, but is it twice as good than a "2"? Hardly. Such grades are ordinal numbers. These are a system of numbers that are ranked in some sense, but the numbers per se do not necessarily reflect a numeric system. In other words, they are highly normative and contested. A "2" might be a good grade for some, and a disaster for others. Ordinal formats are often clearly defined scales that allow people to grade, evaluate or rank certain information. One of the most prominent examples is the Likert scale that is often used in Psychology. In this case, the scaling is often not reflected in numbers at all, but in levels such as "Strongly Agree" or "Rather Disagree". Such constructed scales may make a true statistician very unhappy, since these scales are hard to analyse, yet there is hardly any alternative since it also does not make any sense to ask: "How happy are you on a scale from 1 to 100?". Therefore, ordinal scales are often relevant in order to create a scaling system that allows for wide comparability or even becomes a norm, such as school grades. My advise would be to use ordinal scales when this is common practise in this branch of science. Read other studies in the field, and then decide. These are highly constructed scales, hence there needs to be clear reasoning on why you want to use them.

Nominal data

Gummy bears are a nice example for data formats. You can classify them by color, which would be nominal data. But if you weigh them, you get continuous data again.

Whenever you have categorical data that cannot be ranked, it is called nominal data. An example would be different ethnicities, countries of birth, or different types of gender. This already highlights that we are here confronted by often completely different worldviews, thus nominal data represents a stark case of a normative view of the world. Gender is a prominent example, since some people still define gender by a biological stereotype (Female/Male) and thus binary (see below), which according to my worldview is clearly wrong, and I see gender as nominal with more than two categories. Nominal data formats hence demand an even clearer reflection than ordinal data, where at least you may say that a certain school grade is higher than another one. This is not the case for nominal data. Therefore, one has to be extra careful about the implications that a specific constructed scale may imply.

Binary data

An example of binary data

Binary data is the most reduced data format, which basically consists of two levels: 1 and 0. It is, strictly speaking, nominal data, but nominal data that only exists in two versions which can be translated into 1 and 0: On / Off, Yes / No. In computer science binary data is used directly as simple 0 and 1, but the great breakthrough of that dataset was early on in the insurance business as well as in medicine, where 'dead' or 'alive' are often the most fundamental questions. Binary information is clearly simplistic, but quite often this matches with a certain view of reality. Take the example of being able to play an instrument. If somebody asks you whether you can play the piano, you will probably say yes or no. You may most likely not qualify your answer by saying "I play better than a monkey, but worse than Horowitz". Some modest folks may say "I can play a bit", or "I am not very good", or "I used to be better", but very often people answer yes or no. Hence binary data allows for a simple view of reality, and this may often match with the world how we perceive it. But be aware: Other people may have a less simple view.

Choosing the right data format

You may wonder now how to choose the right data format for your data gathering. The answer to that is quite simple. Any data format should be as simple as possible, and as complex as necessary. Follow Occam's razor, and you will be fine. Of course this sounds appealing, but how to know what is too simple, and what is too complex? Here, I suggest you build on the available literature. Read other publications that examined a certain phenomenon before, these papers may guide you in choosing the right scale.

This table gives you some more information on different data formats - maybe it can help you design your study?

Different data formats and their characteristics.. Source: own


Which simple test works for which data format?

The following table which we compiled shows which statistical tests are useful depending on the data you have. To learn more about these tests, please refer to the entries on Simple Statistical Tests, Regression Analysis, Correlations and ANOVA. Note: for combinations that lead to different methods (e.g. ordinal x continuous), please refer to all mentioned approaches.

Which simple tests do you use for which kinds of data formats? Source: own.


A word on indices

In economics and finance, an index is a statistical measure of change in a representative group of individual data points. A good example of the application of an index that most people know is the GDP, the gross domestic product of a country. Although it has largely been criticised for being too generalised and not offering enough nuance to understand the complexity of the single country, many social, economical and other indicators are correlated with the GDP.

Indices appear also during our every day life like a picture of the latest developments at the stock market.

In ecology, a prominent example for an index is the so-called Shannon Wiener index, which represents abundance corrected diversity measures. A prominent example from economy again is the Dow Jones index while the human development index tries to integrate information about life expectancy, education and income in order to get a general understanding about several components that characterise countries. The GINI coefficient tries to measure inequality, a surely daring endeavour, but nevertheless quite important. In psychology the intelligence quotient (IQ), which is of course heavily criticised, is a known example of reducing many complex tests into one overall number. Indices and quotients are hence constructs that are often based on many variables and try to reduce the complexity of these diverse indicators into one set of numbers.

Further Information

Videos

Intelligence Quotient: Answering the question if the IQ really measures how smart you are

Different data formats: An overview

Binary data: How our computer works

The Shannon Wiener index: An example from ecology

The Dow Jones Index: An example from economy

The Intelligence Quotient: A critical reflection

Descriptive Statistics: A whole video series about descriptive statistics from the Khan academy

Standard Deviation: A brief explanation

Mode, Median, Mean, Range & Standard Deviation: A good summary

Back-of-envelope office space conundrum: A real life example

Articles

Measurement: Reflecting upon different measurement systems across the globe

IQ: An explanation

Nominal vs. ordinal data: A comparison

Likert scale: The most popular rating scale

Ordinal data: Limitations

Nominal data: An explanation

Binary data: An explanation

GDP: A detailed article

The Human Development Index: An alternative to the GDP

The GINI index: A measure of inequality

Descriptive Statistics: An introduction

Descriptive Statistics: A detailed summary

Back of the Envelope Calculation: An explanation

Estimates of Next Quarter’s Unemployment Rate: An Example For Back of the Envelope Statistics


The author of this entry is Henrik von Wehrden.