Handling Categorical Data in Python
THIS ARTICLE IS STILL IN EDITING MODE
Contents
Categorical Data. Definition and types
Categorical (Qualitative) data refers to variables that can take on a limited and usually fixed number of possible values or categories. These categories represent qualitative attributes or labels and do not have a natural numerical order. Categorical data can be divided into two main types: nominal and ordinal. For a deeper background and other data formats, please refer to Data formats Wiki.
Nominal Data:
- Nominal data is qualitative data to name or label variables without providing numeric values. It is the most straightforward type of measurement scale. Nominal variables are labeled into categories that do not overlap. Unlike other data types, nominal data cannot be ordered or measured; it does not have equal spacing between values or a true zero value.
- Examples include colors, gender, types of animals, types of fruits, etc.
Ordinal Data:
- Ordinal data involves categories with a meaningful order or ranking.
- Examples include education levels (e.g., high school, college, graduate), survey responses (e.g., strongly agree, agree, neutral, disagree, strongly disagree), or socioeconomic status categories.
- Ordinal data allows for comparisons like "greater than" or "less than" based on the inherent order, but the differences between the categories may not be consistent.
Necessity of Categorical Data Encoding
- Machine learning algorithms cannot directly interpret categorical data. Most machine learning algorithms, like linear regression, SVM, neural networks, etc. are designed to handle numerical data. They cannot handle data types like text labels, codes, or categories directly. Encoding, in turn, converts such data into numbers, so that algorithms can understand.
- Encoding allows categorical variables to be properly analyzed. By converting values into numbers, the relative levels and orderings between categories can be revealed.
- Encoded data aligns with required inputs for modeling. Most machine learning pipelines require numerical feature vector inputs. Encoding text, ordinal, or nominal categories into numbers prepares the categorical data for machine learning modeling in a way that most algorithms expect. This allows the integration of all useful data - including categorical predictors.
Consequences of Neglecting Categorical Data Encoding
When categorical data is not encoded for machine learning, algorithms may not work. Even if they run, models, that are built on raw categorical data, can make inaccurate predictions and overlook meaningful variables, leading to reduced performance. Ignoring categorical features also skews analysis, prevents quantification of relationships, and hampers the model's ability to capture nuances and correlations. Proper encoding is crucial for accurate, meaningful, and high-performing machine learning models.
Ways to Handle Categorical Data
The following methods use Scikit-learn (sklearn) library.
One-Hot Encoding
One-Hot Encoding is a technique, which converts categorical data into numerical format. It creates a binary vector for each category in the dataset. The vector contains a 1 for the category it represents and 0s for all other categories.
import pandas as pd from sklearn.preprocessing import OneHotEncoder # Create a pandas DataFrame with categorical data df = pd.DataFrame({'color': ['red', 'blue', 'green', 'green', 'red']}) # show df print(df) # Create an instance of OneHotEncoder encoder = OneHotEncoder() # Fit and transform the DataFrame using the encoder encoded_data = encoder.fit_transform(df) # Convert the encoded data into a pandas DataFrame encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out()) print(encoded_df)
df
:
color | |
---|---|
0 | red |
1 | blue |
2 | green |
3 | green |
4 | red |
encoded_df
:
color_blue | color_green | color_red | |
---|---|---|---|
0 | 0.0 | 0.0 | 1.0 |
1 | 1.0 | 0.0 | 0.0 |
2 | 0.0 | 1.0 | 0.0 |
3 | 0.0 | 1.0 | 0.0 |
4 | 0.0 | 0.0 | 1.0 |
Limitations of One-Hot Encoding
- High Dimensionality: One-hot encoding results in high-dimensional vectors, especially when dealing with categorical variables with many unique values. This can lead to increased computational complexity and memory requirements.
- Lack of Semantic Similarity Information: One-hot encoding treats all categories as independent and equidistant, ignoring any inherent relationships or similarities between them. This may not capture the semantic meaning or hierarchy, presented in certain categorical variables.
Ordinal Encoding
Ordinal encoding is a preprocessing technique for converting categorical data into numeric values, that preserves their inherent ordering. It is useful when working with machine learning models like neural networks, that expect numerical input features.
Ordinal encoding provides two key benefits:
1. Encoding categorical data into numeric forms, which algorithms can understand.
2. Retaining the ordinal information between categories, what could not be performed by one-hot encoding.
import numpy as np from sklearn.preprocessing import OrdinalEncoder encoder = OrdinalEncoder() sizes = ["small", "medium", "large"] sizes = np.array(sizes).reshape(-1,1) # reshape to 2D array encoded = encoder.fit_transform(sizes) print(encoded)
[[2.]
[1.]
[0.]]
Limitation of Ordinal Encoding
- Assumption of Equal Intervals: Ordinal encoding assumes that the differences between categories are equal, which may not be true in all cases. It assigns a numerical value, based on the order without considering the magnitude of differences between the categories.
- Limited Expressiveness: It might not capture the full range of relationships between categories. Ordinal encoding does not convey information about the actual distances or similarities between categories, which can be a limitation in certain data analysis scenarios.
Label Encoding
Label Encoding assigns a unique numerical label to each category in a feature.
from sklearn.preprocessing import LabelEncoder # Sample data colors = ['red', 'green', 'blue', 'green', 'red'] # Create a label encoder label_encoder = LabelEncoder() # Fit and transform the data encoded_colors = label_encoder.fit_transform(colors) print(encoded_colors)
[2 1 0 1 2]
Limitation of label Encoding
- Label encoding converts the categorical data into numerical ones, but it assigns a unique number (starting from 0) to each class of data. This may lead to the generation of priority issues during model training of datasets. A label with a high value may be considered to have high priority than a label having a lower value.
Example For Limitation of Label Encoding
An attribute having output classes Mexico, Paris, Dubai. On Label Encoding, this column lets Mexico is replaced with 0, Paris - with 1, and Dubai - with 2. With this, it can be interpreted that Dubai has higher priority than Mexico and Paris while training the model. However, there is no such priority relation between these cities here.
Target Encoding
Target Encoding, also known as mean encoding or likelihood encoding, transforms categorical variables based on the mean of the target variable for each category. This method is particularly useful when dealing with high-cardinality categorical features.
Steps in Target Encoding
1. Grouping by Category: Group the data by the categorical feature.
2. Calculating Means: For each category, calculate the mean (or another target-related statistic) of the target variable.
3. Assigning Encoded Values: Replace the categorical values with their corresponding mean values.
import pandas as pd from sklearn.model_selection import train_test_split # Sample data data = {'Category': ['A', 'B', 'A', 'B', 'A', 'C'], 'Target': [1, 0, 1, 1, 0, 0]} df = pd.DataFrame(data) # Calculate means by categories means = df.groupby('Category')['Target'].mean() # Map the means to the categories df['Category_Encoded'] = df['Category'].map(round(means,2)) print(df)
Category | Target | Category_Encoded | |
---|---|---|---|
0 | A | 1 | 0.67 |
1 | B | 0 | 0.50 |
2 | A | 1 | 0.67 |
3 | B | 1 | 0.50 |
4 | A | 0 | 0.67 |
5 | C | 0 | 0.00 |
In this example, the 'Category' column is encoded with the mean of the 'Target' variable for each category. This encoding can be used as a feature in machine learning models.
Note: Care should be taken to avoid data leakage and overfitting when applying target encoding, especially when encoding the entire dataset without proper cross-validation.
References
1. https://scikit-learn.org/stable/modules/preprocessing.html#encoding-categorical-features
2. https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
3. https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/
The author of this entry is Shree Shangaavi N.