Handling Missing Values in Python

THIS ARTICLE IS STILL IN EDITING MODE

Missing values: types and distributions

Missing values are a common problem in many datasets. They can occur for a variety of reasons, such as data not being collected or recorded accurately, data being excluded because it was deemed irrelevant, or respondents being unable or unwilling to provide answers to certain questions (Tsikriktsis 2005, 54-55).

In this text, we will explore the different types of missing values and their distributions and discuss the implications for data analysis.

Types of missing values

There are two main types of missing values: unit nonresponse and item nonresponse missing values. Item nonresponse occurs when an individual respondent is unable to provide an answer to a specific question on a survey or questionnaire (Schafer and Graham 2002, 149).

Unit nonresponse occurs when an entire unit, such as a household or business, is unable to provide answers to a survey or questionnaire (ibid.).

Next, we will look at how missing values can be distributed and what the implications of such distributions are. Generally, both types of missing values can occur in any distribution.

Distributions of missing values

The distribution of missing values in a dataset can be either random or non-random. This can have a significant impact on the analysis and conclusions drawn from the data. Three common distributions of missing values are missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR) (Tsikriktsis 2005, 55).

Missing completely at random

Missing completely at random (MCAR) is a type of missing data where the missing values are not related to any other variables in the dataset, and they do not follow any particular pattern or trend. In other words, the missing values are completely random and do not contain any necessary information (Tsikriktsis 2005, 55).

The implications of MCAR for data analysis are relatively straightforward. Because the missing values are completely random, they do not introduce any bias into the analysis. Therefore, it is generally safe to impute the missing values using statistical methods, such as mean imputation or multiple imputations. However, even if the missing values are MCAR, there may still be other factors that can affect the analysis. It is important to consider the number and proportion of missing values (Scheffer 2002, 156). The larger the proportion of missing values in your overall dataset the less reliable is the use of the data. Imagine you had many unit nonresponse missing values across many different individuals, which results in having no variable without any missing value. This might affect the quality of your dataset. If and how this is the case, needs to be decided case by case.

Missing at random

Missing at random (MAR) is a type of missing data where the missing values are not related to the missing values themselves, but they might be to other variables in the dataset. In other words, the missing values are not completely random, but they are not systematically related to the true value of the missing values either (Tsikriktsis 2005, 55). For example, imagine you conduct a survey to analyze the relationship between education and income and there are missing values concerning income. If the missing values depend on education, then these missing values are missing at random. If they would depend on their actual income, they would not.

The implications of MAR for data analysis are more complex than those for MCAR. Because the missing values are not completely random, they may introduce bias into the analysis if they are not properly accounted for. Therefore, it is important to carefully consider the underlying reasons for the missing data and take these into account when imputing the missing values. One common approach to dealing with MAR missing values is to use regression or other statistical methods to model the relationship between the missing values and the other variables in the dataset (Tsikriktsis 2005, 56). Once the relationship is clear, other methods can be used to approximate to correct the variables for the bias due to the missing values missing at random.

Missing not at random

Missing not at random (MNAR) is a type of missing data that is related to both the observed and unobserved data. This means that the missing data are not random and are instead influenced by some underlying factor. This can lead to biased results if the missing data are not properly accounted for in the analysis (Tsikriktsis 2005, 55).

The implications of MNAR for data analysis are more complex than those for MCAR or MAR. Because the missing values are systematically related to the true values of the missing data, they can introduce bias into the analysis if they are not properly accounted for. In some cases, this bias may be difficult or impossible to correct, even with advanced statistical methods (Tsikriktsis 2005, 55).

Determining the randomness of missing data

There are two common methods to determine the randomness of missing data. The first method involves forming two groups: one with missing data for a single variable and one with valid values for that variable. If significant differences are found between the two groups regarding their relationship to other variables of interest, it may indicate a non-random missing data process. The second method involves assessing the correlation of missing data for any pair of variables. If low correlations are found between pairs of variables, it may indicate complete randomness in the missing data (MCAR). However, if significant correlations are found between some pairs of variables, it may be necessary to assume that the data are only missing at random (MAR) (Tsikriktsis 2005, 55 - 56).

Overall, the treatment of missing values should be tailored to the specific distribution of missing values in the dataset. It is important to carefully consider the underlying reasons for the missing data and take appropriate steps to address them in order to ensure the accuracy and reliability of the analysis.

References

Schafer, Joseph L., and John W. Graham. "Missing data: our view of the state of the art." Psychological methods 7, no. 2 (2002): 147.

Scheffer, Judi. "Dealing with missing data." (2002).

Tsikriktsis, Nikos. "A review of techniques for treating missing data in OM survey research." Journal of operations management 24, no. 1 (2005): 53-62.

The author of this entry is Finja Schneider. Edited by Milan Maushart.