Feature Crosses & Dimensionality¶
What are Feature Crosses?¶
Feature crosses combine single features together via dot-product or inner-product to help represent nonlinear relationships. This is particularly useful when individual features don't capture complex interactions in the data.
High-Dimensional Feature Crosses¶
The Problem¶
Using logistic regression as an example, when a dataset contains feature vector \(X=(x_1, x_2, ..., x_k)\), the model would have:
Where \(w_{ij}\) is of dimension \(n_{x_i} \cdot n_{x_j}\). When \(n_{x_i} \times n_{x_j}\) is huge (especially in use cases like website customers and number of goods), this creates an extremely high-dimensional problem.
The Solution: Dimensionality Reduction¶
One way to get around this is to use a k-dimensional low-dimension vector (k << m, k << n).
Now, \(w_{ij} = x_i' \cdot x_j'\) and the number of parameters to tune becomes \(m \times k + n \times k\).
This can also be viewed as matrix factorization, which has been widely used in recommendation systems.
Matrix Factorization Example¶
import numpy as np
# Original high-dimensional features
n_users = 1000
n_items = 5000
k = 50 # Low-dimensional representation
# Create low-dimensional embeddings
user_embeddings = np.random.randn(n_users, k)
item_embeddings = np.random.randn(n_items, k)
# Instead of n_users * n_items parameters,
# we now have n_users * k + n_items * k parameters
total_params = n_users * k + n_items * k
print(f"Parameters reduced from {n_users * n_items:,} to {total_params:,}")
Feature Cross Selection¶
The Challenge¶
In reality, we face a variety of high-dimensional features. A single feature cross of all different pairs would induce: 1. Too many parameters 2. Overfitting issues
Effective Feature Combination Selection¶
We introduce feature cross selection based on decision tree models. Taking CTR (Click-Through Rate) prediction as an example:
Input features: age, gender, user type (free vs paid), searched item type (skincare vs foods)
Decision tree approach: 1. Make a decision tree from the original input and their labels 2. View the feature crosses from the tree 3. Extract meaningful feature combinations
Example feature crosses from tree: 1. age + gender 2. age + searched item type 3. paid user + search item type 4. paid user + age
Gradient Boosting Decision Trees (GBDT)¶
How to best construct the decision trees?
One can use Gradient Boosting Decision Trees (GBDT). The idea behind this is that before constructing a decision tree, we first calculate the error from the true value and iteratively construct the tree from the error.
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
# Example implementation
def extract_feature_crosses(X, y, n_estimators=100):
"""
Extract feature crosses using GBDT
"""
gbdt = GradientBoostingRegressor(n_estimators=n_estimators, max_depth=3)
gbdt.fit(X, y)
# Extract feature importance
feature_importance = gbdt.feature_importances_
# Get feature crosses from tree structure
feature_crosses = []
for tree in gbdt.estimators_:
# Extract decision paths and identify feature combinations
# This is a simplified version - actual implementation would be more complex
pass
return feature_crosses
Implementation Strategies¶
1. Manual Feature Engineering¶
import pandas as pd
# Create feature crosses manually
def create_feature_crosses(df):
# Age + Gender cross
df['age_gender'] = df['age'].astype(str) + '_' + df['gender']
# User type + Item type cross
df['user_item_cross'] = df['user_type'] + '_' + df['item_type']
return df
2. Polynomial Features¶
from sklearn.preprocessing import PolynomialFeatures
# Create polynomial features (degree 2)
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)
print(f"Original features: {X.shape[1]}")
print(f"Polynomial features: {X_poly.shape[1]}")
3. Factorization Machines¶
# Using a library like fastFM or similar
from fastFM import als
# Factorization Machine for feature interactions
fm = als.FMRegression(n_iter=1000, init_stdev=0.1, rank=8, l2_reg_w=0.1, l2_reg_V=0.1)
fm.fit(X_train, y_train)
Best Practices¶
When to Use Feature Crosses:¶
- Domain knowledge suggests interactions exist
- Linear models need to capture nonlinear relationships
- High-cardinality categorical features
- Recommendation systems and collaborative filtering
Implementation Guidelines:¶
- Start with domain knowledge - Don't cross everything
- Use tree-based methods to identify important interactions
- Monitor for overfitting - Cross-validation is crucial
- Consider computational cost - High-dimensional crosses are expensive
- Use regularization - L1/L2 regularization helps with sparse crosses
Common Pitfalls:¶
- Curse of dimensionality - Too many crosses lead to sparse data
- Overfitting - Complex crosses without enough data
- Computational expense - High-dimensional crosses are slow
- Loss of interpretability - Complex crosses are hard to explain
Related Topics¶
- Data Types & Normalization - Preparing features for crosses
- Categorical Encoding - Encoding categorical features for crosses
- Model Evaluation - Evaluating models with feature crosses
- Regularization - Regularizing high-dimensional feature crosses