Categorical Feature Encoding¶

Understanding Categorical Features¶

Categorical features include data like male/female, blood type (A,B,AB,O), and other variables that can only select values from a finite set of choices. Categorical features are originally input as strings.

Important Note: While decision trees and some other models can directly take in strings, for logistic regression or SVM models, categorical features need to be translated to numerical form to work properly.

Encoding Methods¶

1. Ordinal Encoding¶

Use Case: Treats data that has ordinal sequence (e.g., high > middle > low)

Method: Assigns numerical IDs that retain the high-to-low relationship

Example: - High → 3 - Middle → 2
- Low → 1

Characteristics: - Preserves ordinal relationships - Simple and interpretable - Assumes meaningful order exists

from sklearn.preprocessing import OrdinalEncoder

# Example data
categories = \[\[['high'], ['low'], ['middle'], ['high'], ['low']\]

# Create encoder
encoder = OrdinalEncoder(categories=\[\[['low', 'middle', 'high']\]\])

# Fit and transform
encoded = encoder.fit_transform(categories)
print(encoded)  # \[\[[2], [0], [1], [2], [0]\]

2. One-Hot Encoding¶

Use Case: Treats features that do not have ordinal relationships (e.g., blood type)

Method: Creates binary vectors for each category

Example for Blood Type: - Type A → [1, 0, 0, 0] - Type B → [0, 1, 0, 0] - Type AB → [0, 0, 1, 0] - Type O → [0, 0, 0, 1]

Characteristics: - No ordinal relationship assumed - Creates sparse vectors - Increases dimensionality significantly

Challenges: 1. High-dimensional features can be difficult in: - K-nearest neighbors: Distance between high-dimensional vectors is hard to measure - Logistic regression: Parameters increase with higher dimensions, causing overfitting - Clustering: Only some dimensions may be helpful

Sparse vectors for saving space

import pandas as pd

# Example data
data = pd.DataFrame({'blood_type': ['A', 'B', 'AB', 'O', 'A']})

# One-hot encoding
one_hot = pd.get_dummies(data, columns=['blood_type'])
print(one_hot)

3. Binary Encoding¶

Use Case: Alternative to one-hot encoding for space efficiency

Method: Uses binary representation to do a hash mapping on the original category ID

Characteristics: - Saves space compared to one-hot encoding - Usually fewer dimensions - Maintains some category information

import category_encoders as ce

# Example data
data = pd.DataFrame({'category': ['A', 'B', 'C', 'D', 'A']})

# Binary encoding
encoder = ce.BinaryEncoder(cols=['category'])
binary_encoded = encoder.fit_transform(data)
print(binary_encoded)

Advanced Encoding Techniques¶

Target Encoding (Mean Encoding)¶

Method: Replaces categories with the mean of the target variable for that category

Advantages: - Captures relationship with target - Reduces dimensionality - Handles high-cardinality features

Disadvantages: - Risk of overfitting - Requires careful cross-validation

from category_encoders import TargetEncoder

# Example with target variable
X = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'B']})
y = pd.Series([1, 0, 1, 0, 1])

# Target encoding
encoder = TargetEncoder(cols=['category'])
encoded = encoder.fit_transform(X, y)
print(encoded)

Hash Encoding¶

Method: Uses hash functions to map categories to a fixed number of features

Advantages: - Handles high-cardinality features - Fixed output dimensionality - Memory efficient

Disadvantages: - Potential hash collisions - Less interpretable

Best Practices¶

When to Use Each Method:¶

Ordinal Encoding:
Clear ordinal relationship exists
Categories have meaningful order
Tree-based models
One-Hot Encoding:
No ordinal relationship
Small number of categories (< 10)
Linear models
Binary Encoding:
Medium number of categories (10-100)
Memory constraints
Want to reduce dimensionality
Target Encoding:
High-cardinality features
Clear relationship with target
Proper cross-validation setup

Implementation Guidelines:¶

Handle missing values before encoding
Fit encoders on training data only
Apply same encoding to test data
Consider feature interactions after encoding
Monitor for overfitting with high-cardinality features

Common Pitfalls:¶

Data leakage: Fitting encoders on test data
Overfitting: Using target encoding without proper validation
Dimensionality explosion: One-hot encoding high-cardinality features
Losing information: Using ordinal encoding for non-ordinal data

Data Types & Normalization - Understanding different data types
Feature Crosses - Combining encoded features
Model Evaluation - How encoding affects model performance